Metagenomic example

Step 1: Import datasets (454 sequencing reads and corresponding quality scores) into your current history:

Step 2: Use the "NGS: QC and manipulation > Select high quality segments" tool to extract high quality regions from these reads. Change the value of "Minimal length of contiguous segment" to 50. Default values should be sufficient for other parameters. This will extract contiguous high quality segments from the reads.

Step 3: The default read names are unwieldy and not handled well by Megablast. We will rename the reads with a numeric index. Use the "NGS: QC and manipulation > Rename sequences" tool and select "Rename sequences to" "numeric counter".

Step 4: Use "NGS: Mapping > Megablast" to map reads to the wgs database. Set the identity threshold to 80% and the E-value cutoff to 0.0001. Inspect the resulting alignments. 

Step 5: Filter alignments to include only those with query coverage greater than 0.5. First we will use "FASTA Manipulation > Compute Sequence Length" to find the length of each query sequence (run this on the high quality segments). Next, we want to join the Megablast results with the sequence lengths. Both datasets have sequence identifiers in column 1. Use "Join Subtract and Group > Join two Queries" to put them together. Finally, filter this dataset using "Filter and sort > Filter" to remove lines with low alignment coverage (c5/c15 > 0.5).

Step 6: Now, map sequences back to their taxonomic position. Each megablast result contains the unique identifier GI as the second column. Use the "Metagenomic Analysis > Fetch Taxonomic Representation" tool to map these reads to their taxonomic representation.

Step 7: Use the "Metagenomic Analysis > Find lowest diagnostic ranks" and find reads that are diagnostic for a classification level below Kingdom.

Step 8: Use "Metagenomic Analysis > Summarize Taxonomy" to aggregate over all the diagnostic reads to get a summary of the number of reads diagnostic for each clade.

Step 9: Use "Metagenomic Analysis > Draw Phylogeny" to draw a taxonomic tree with each node annotated with the number of diagnostic hits for that node.

On your own...

Can you extract a workflow, edit it to make the megablast database a parameter, and rerun against the nt database?

Can you acquire a different metagenomic dataset (perhaps from a public database) and run the same analysis on it?