This document is a live copy of supplementary materials for the manuscript. It provides access to the exact analyses and workflows discussed in the paper, so you can play with them by re-running, changing parameters, or even applying them to your own data. Specifically, we provide the two histories and one workflow found below. You can view these items by clicking on their name to expand them. You can also import these items into your Galaxy workspace and start using them; click on the green plus to import an item. To import workflows you must create a Galaxy account (unless you already have one) – a hassle-free procedure where you are only asked for a username and password.
This is the Galaxy history detailing the comparison of our pipeline to MEGAN:
This is the Galaxy history showing a generic analysis of metagenomic data. (This corresponds to the "A complete metagenomic pipeline" section of the manuscript and Figure 3A):
This is the Galaxy workflow for generic analysis of metagenomic data. (This corresponds to the "A complete metagenomic pipeline" section of the manuscript and Figure 3B):
Windshield Splatter datasets analyzed in this manuscript can be accessed through this Galaxy Library. From there they can be re-analyzed through Galaxy using the above workflows or downloaded.
(Use this link to see Galaxy history representing this analysis. Individual elements of this history are referred to as History Item1, 2 and so on using bold typeface)
The first step of a homology-based metagenomic analysis is to contrast a collection of sequencing reads against a database whose entries are assigned to taxonomic ranks. Following the procedure of (Huson et al. 2007) we used the non-redundant protein database (NR) from the National Center for Biotechnology Information. There are several avenues for importing large sets of alignments into Galaxy. First, alignments can be generated directly within Galaxy (see the following section). Alternatively, alignments generated elsewhere (e.g., using local BLAST installations of web-based resources such as CAMERA (Seshadri et al. 2007); see below) can be uploaded in either tab-delimited or XML format. To demonstrate this functionality, we generated alignments in BLAST XML format outside of Galaxy using the BLASTx program of the BLAST package (Altschul et al. 1990) and then uploaded them into Galaxy’s history. Galaxy includes a parser for XML generated by BLAST programs that produces a tab-delimited format that can be easily used in downstream analyses. Only 243 (or ~2% from 3,812,372 alignments) and 1,192 (or ~11% from 3,581,932 alignments) reads from samples 1 and 2-4, respectively (History Items 1 and 2), did not produce matches against the NR database. These counts were slightly higher than those reported in Huson et al. because we set the BLAST E value flag (-e) to 0.01 instead of the default value of 10 (used in (Huson et al. 2007)) removing many weakly supported alignments and significantly decreasing the size of the resultant file. Similarly to Huson and colleagues we further filtered BLAST alignments by retaining only those hits that were within 5% of the best score for every read using a combination of Galaxy tools (History items 3 – 8. Here we first selected lines with the highest bit score per read [History Items 3 and 4]. Next we joined these lines with the original files using the join tool [History items 5 and 6]. Finally, we selected those lines from datasets 5 and 6 where the bit score was within 5% of the maximum [History Items 7 and 8]). This significantly reduced number of hits to 54,458 and 62,647 in samples 1 and 2-4, respectively, although the number of reads producing these hits did not change (9,757 and 8,808 reads, respectively).
Because every entry within the NR database is assigned a taxonomy id, it is straightforward to create a phylogenetic profile of every read that aligns against a database sequence. Galaxy features the Fetch Taxonomic Ranks tool that quickly parses NCBI taxonomy and writes out a taxonomic string consisting of 21 taxonomic ranks from superkingdom to subspecies. Application of this tool to filtered BLAST hits produced 54,458 and 62,647 taxonomic strings for samples 1 and 2-4, respectively (History Items 9 and 10). Note that because the numbers of taxonomic strings greatly surpass the numbers of sequencing reads (9,757 and 8,808, respectively), each read is likely represented by multiple phylogenetic profiles. As a result all reads can be divided into two categories: diagnostic and non-specific. A diagnostic read consistently hits database sequences belonging to the same taxonomic group, while its non-specific counterpart identifies with multiple taxa. (An extreme example of a non-specific read will produce alignments with both eukaryotic and prokaryotic sequences and as a result will be useless for phylogenetic profiling of metagenomic samples). Furthermore, as biological classification is hierarchical, a read can be diagnostic at one level and non-specific at another: if a given read produces alignments with multiple database sequences yet all these sequences belong to the same genus, we consider such read diagnostic for that genus. It is easy to envision a situation when a read diagnostic to a genus will hit multiple species within a genus. For instance, a read producing 10 alignments all within the genus Drosophila may, at the species level, align with sequences from D. melanogaster and D. ananassae. Thus such a read is diagnostic at the genus level but non-specific at the species level. In addition, even when a read represents species A, it will likely also produce alignments with a closely related species B and therefore will appear non-specific at the species level. There are two ways to address this situation. First, one can tabulate a list of reads diagnostic at a predefined taxonomic level. In Galaxy this is achieved with the “Find diagnostic hits” tool (Table 1) within which the user specifies desired taxonomic ranks and the tool returns reads diagnostic for such ranks. Alternatively, one can traverse the taxonomic strings of every read by identifying and removing reads with more than one taxonomic label (see explanation of the tool’s algorithm at the Galaxy web site under “Metagenomic Tools” - “Find lowest taxonomic rank”). This approach is conceptually identical to the Lowest Common Ancestor (LCA) algorithm of (Huson et al. 2007) and is implemented in Find lowest diagnostic rank tool. We used this tool here to directly compare our implementation to results produced by MEGAN software. For samples 1 and 2-4 we identified 9,380 and 7,847s reads that we were diagnostic below the Kingdom level (History Items 11 and 12). These numbers are slightly higher than those reported by Huson et al. One reason for this is the fact that these used a version of the NR database that is roughly two years older than the one used by us in this study. Finally, we visualized results of our analysis using the Draw phylogeny tool that renders phylogenetic trees using taxonomy datasets as input (History Items 13 and 14). Figure 1 shows the portion of the tree for Gammaproteobacteria in the two samples. The resulting topology and read numbers are nearly identical to those produced by MEGAN with this dataset (see Figures 3C and 3D in (Huson et al. 2007)) suggesting that our approach works correctly.
Supplementary Figure S1. Genus-level phylogenetic profile of class Gammaproteobacteria reconstructed from protein-level comparisons. The color of the branches represents the relative abundance of sequencing reads representing that branch (red = more; blue = less). Numbers within each box signify the number of sequencing reads associated with a given taxon. Branches without labels identify reads that do not identify with any ranks above genus level (in this case unidentified uncultured gammaproteobacterium)
Supplementary Figure S2. Analysis of Sargasso Sea metagenomic reads from Samples 1 (A) and 2-4 (B) as described in (Huson et al. 2007). Read length = distribution or read lengths. Alignment length = distribution of lengths of megaBLAST hits produced by aligning the reads against NT and WGS databases. Alignable fraction = distribution of proportion of each read’s length covered by megaBLAST hits against NT and WGS databases. Q1, Q2, and Q3 = first, second (median), and third quartiles.
Supplementary Figure S3. Genus-level phylogenetic profile of class Gammaproteobacteria reconstructed from nucleotide-level comparisons. The color of the branches represents the relative abundance of sequencing reads representing that branch (red = more; blue = less). Numbers within each box signify the number of sequencing reads associated with a given taxon. Branches without labels identify reads that do not identify with any ranks above genus level (in this case unidentified uncultured gammaproteobacterium).
Supplementary Figure S4. Analysis of 454 read quality for trip A (A) and B (B). The distribution of base quality scores (in phred metric) for all sequencing reads in the experiment.To produce this image each read was divided into 20 equal sized segments and quality scores for all bases falling within each segment were averaged.These average quality values from all read were then used to produce the box plot.
Supplementary Figure S5. Analysis of read fragmentation by low quality bases. (A). Length distribution of fragments generated by splitting the reads on any base with quality score < 20 (phred metric). (B). Length distribution of fragments generated by splitting the reads on bases with quality score < 20 that are NOT in the proximity of homopolymer runs.
Supplementary Figure S6. Distribution of alignment length, alignment identity, and alignable fraction for 454 reads for trip A (A) and B (B). Q1, Q2, and Q3 = first, second (median), and third quartiles.
Supplementary Figure S7. Genus-level phylogenetic profile of class Gammaproteobacteria obtained by comparing trip A (A) and trip B (B) reads against NT and WGS databases.
Supplementary Figure S8. Example of using Galaxy to process alignment results generated within CAMERA system. Using the “Export” drop down of CAMERA interface we downloaded results in BLAST XML format (A). These data are them uploaded into Galaxy and processed using its XML-parser (B). Errors (red text in A) were resulting from some reads being too short to be used for computing the Altschul-Karlin statistics used in megaBLAST.
Supplementary Figure S9. The precipitation and temperature data along the collection routes. Data are from the US Department of Agriculture web site.
Supplementary Table 1. Gamma-proteobacterial genera