Finding Needle in a Hay Stack

-Abbott Laboratories/UCSF HPgV-2 Discovery Project 

Jason Zhang, Feb. 2016

Abstract--Galaxy platform advances in its configurability, adaptability, extensibility and reproducibility. By recruiting numerous bioinformatics tools, Galaxy-based analysis has been applied in all areas related to data intensive biomedical research including metagenomics and clinical diagnostics.  In current report, I set up a Galaxy server on a Mac laptop and installed software for  metagenomics analysis. Using customized Galaxy workflow and public available data, metagenomics analysis on human patient sample correctly identified related pathogens. In addition, three metagenomics tools are tested and compared.


While next generation sequencing has changed the nature and scope of genomic research, NGS starts to show its potential in clinical diagnostics. The large volume of data generated at decreasing cost makes NGS an ideal platform for comprehensive mutation analysis (germline alterations and somatic alterations) as well as metagenomic analysis including phylogeny and taxonomy of pathogens.

A recent study from Abbott Laboratories presented the discovery of a new pegivirus (HPgV-2) by next-generation sequencing of plasma from an HCV-infected patient. Initially, metagenomic next-generating sequencing and data analysis by UCSF/Abbott Laboratories identified three NGS reads out of ~107 paired-end reads from the plasma sample of a HCV-infected patient, which share 60% animo acid identity to a known pegivirus (simian pegivirus A). Further experiment and analysis confirmed this virus as a new strain of  HPgV. The genome of HPgV-2 was sequenced and de novo assembled by NGS and Sanger sequencing.

In this post, I will use a Galaxy server installed on my Macbook Pro to try interpret some of the data on the HPgV-2 study from Abbott Laboratories. All the raw data are publicly available through NCBI SRA database, and all the tools are open-source software. My first goal is to perform taxonomic analysis on the  HPgV-2 infected patient's metagenomics data, and to find the best tool that can efficiently handle this task in Galaxy. I tested three widely used metagenomics platforms including MegaBLASTKraken and MetaPhlAn, compared their performance and discussed the differences in the output. 

Metagenomics software

MegaBLAST is the most well-know, highly sensitive alignment algorithm and one of the best methods for assigning a taxonomic label to an unknown sequence. MegaBLAST can classify a sequence by finding the best alignment to a large database of genomic sequences. Kraken takes advantage of a pre-defined database that contains records consisting of a k-mer and the Lowest Common Ancestor of all organisms whose genomes contain that k-mer, and match each k-mer from the query sequence to this pre-defined index. Metaphlan utilizes a database that is much smaller than the collection of all genomes, which allows it to perform classification much faster than methods that attempt to identify every read in a data set. The database are engineered to contain 'marker' genes that have been found to be specific to certain clades. Therefore, they are meant to be used to characterize the distribution of organisms present in given sample rather than labelling every single read in a sample.

NGS data retrieval

Metagenomic NGS data corresponding to plasma samples from chronic liver disease patients have been submitted to the NCBI Sequence Read Archive with accession number SRP066211. NGS reads were filtered for exclusion of human sequences by Bowtie2 high-sensitivity local alignment to the human hg38 reference database. Data from patient case UC0125.US was downloaded and used for subsequent analysis.

Galaxy workflow

An example workflow of metagenomic analysis by Galaxy is shown below, which is conducted on server. This is a very interesting study about metagenomic analysis of samples collected by the windshield of a moving vehicle (check the original publication here)

I used a similar Galaxy workflow for the NGS data pre-processing. In order to conduct the HPgV-2 sample analysis, I used a Galaxy server on my private laptop because the software needed (megaBLAST, Kraken and MetaPhlAn) are currently not available from public Galaxy web server. For the same reason, the workflow cannot be published onto public Galaxy server, so I just summarized the results from my Galaxy workflow in following sections.

Result: Platform comparison:

Comparison of 3 major metagenomics tool: MegaBLAST, Kraken, MetaPhlAn
Number of Query reads450,000450,0007,745,500
Database size36 Gb4.2 Gb6 Mb
Run time34:55:0000:01:0000:00:36
Number of reads mapped220,000 (49%)160,000 (36%)2047 (<0.3%)

Result: Phylogeny of pathogens from HCV-infected patient 

---MegaBLAST utilized the most comprehensive nt database of NCBI, which include most updated genome sequences for all organisms (recent update: Feb-26-2016). Therefore, MegaBLAST achieved the highest sensitivity on read mapping (220000 mapped from 450000 reads). The remaining 50% unassigned reads represent those unknown species from the sample. This ratio of known/unknown species is in range with at least two environmental metagenomics study: City-scale metagenomics and Metagenomics of Airborne microbial communities

A pie-chart of MegaBLAST result is shown in the link blow. As shown in this plot, the major pathogen in the patient sample includes Hepatitis C virus, as well as a co-infected Torque ten virus (or Transfusion Transmitted Virus). TTV is a virus that is often found in patients with liver disease and is highly associated with HCV and HBV infection. Surprisingly, the new HPgV-2 virus discovered in this study was also captured by megaBLAST.  I checked the release date of the whole genome of HPgV-2 to NCBI nt library and realized that it was just before I obtained my nt database in Feb. 2016, so MegaBLAST did very good job on capturing this new species in the sample! However, the abundance of HPgV-2 reads appears to be extremely low, with only 8 from 450,000 query reads (0.002%). This result demonstrated the high sensitivity of MegaBLAST. On the other hand, MegaBLAST achieved high sensitivity by sacrificing speed. The BLAST run took about 1.5 day on the macbook pro, while the other programs took only a minute or even less.

 Click here to view the interactive pylogenetic plot from megaBLAST

---Kraken database (MiniKraken) includes only bacteria, archaeal and virus genomes from RefSeq, therefore the ~1% eukaryotes species shown by MegaBLAST is missed by Kraken. However, Kraken delivered very good result on the speed test, assigning the same dataset as the one used by MegaBLAST in just about a minute. Mapped reads ratio (~36%) is slightly lower than MegaBLAST, which can be explained by the database difference and algorithm difference: kraken alignment is based on exact k-mer match, while MegaBLAST calculate and rank alignment based on sequence homology; The MiniKraken database is a minimized Kraken k-mer database optimized for better performance on laptop computing (Macbook), while nt library is the most complete and updated library.  

 Click here to view the interactive pylogenetic plot from Kraken

---MetaPhlAn takes advantage of a small database contains only the 'marker' genes for certain genome. Identifying those genome-specific marker genes isn't an easy task, therefore the initial version of  MetaPhlAn database ONLY covers bacteria genome. Since the sample is not bacteria-enriched, mapping reads from other species (mainly virus) onto bacteria genome is not a good approach at all. Subtracting reads from virus and other species must be performed in order to take advantage of the MetaPhlAn bacteria database (I will cover this in another post). Therefore, MetaPhlAn does not identify virus and result of bacteria taxonomy of HCV patients may not be reliable in current MetaPhlAn workflow due to enrichment of reads from virus.

 Click here to view the interactive pylogenetic plot from MetaPhlAn

---A new version MetaPhlAn2.0 provides the database that cover virus and other microorganism genomes, which match the needs for patient dataset analysis more accurately. I run this new version on The author's Galaxy Server and got much better read mapping and taxonomy plot. However, MetaPhlAn2.0 showed a much smaller percentage of bacteria species in the  phylogeny plot comparing with that showed by MegaBLAST and Kraken results.  This might be caused by the different alghrithm used in MetaPhlAn to characterize reads and count species. 

 Click here to view the interactive pylogenetic plot from MetaPhlAn2.0