Published Pages | gm2123 | CloudMap

CloudMap: A Cloud-based Pipeline for Analysis of Mutant Genome Sequences

Gregory Minevich 1,§, Danny S. Park 1, Daniel Blankenberg 2, Richard J. Poole 1,3,§, and Oliver Hobert 1,§

1 Department of Biochemistry and Molecular Biophysics, Howard Hughes Medical Institute, Columbia University Medical Center, New York, NY, USA

2 Center for Comparative Genomics and Bioinformatics, Penn State University, University Park, PA, USA

3 Present address: Department of Cell & Developmental Biology, University College London, London WC1E 6B

§Correspondence to gm2123@columbia.edu (G.M.), r.poole@ucl.ac.uk (R.J.P.) or or38@columbia.edu (O.H.)

Please also see www.hobertlab.org/cloudmap for FAQs

Abstract

Whole genome sequencing (WGS) allows researchers to pinpoint genetic differences between individuals and significantly shortcuts the costly and time-consuming part of forward genetic analysis in model organism systems. Currently, the most effort-intensive part of WGS is the bioinformatic analysis of the relatively short reads generated by second generation sequencing platforms. We describe here a novel, easily accessible and cloud-based pipeline, called CloudMap, which greatly simplifies the analysis of mutant genome sequences. Available on the Galaxy web platform, CloudMap requires no software installation when run on the cloud, but it can also be run locally or via Amazon’s Elastic Compute Cloud (EC2) service. CloudMap uses a series of pre-defined workflows to pinpoint sequence variations in animal genomes, such as those of pre-mutagenized and mutagenized Caenorhabditis elegans strains. In combination with a variant-based mapping procedure, CloudMap allows users to sharply define genetic map intervals graphically and to retrieve very short lists of candidate variants with a few simple clicks. Automated workflows and extensive video user guides are available to detail the individual analysis steps performed (http://usegalaxy.org/cloudmap). We demonstrate the utility of CloudMap for WGS analysis of C. elegans and Arabidopsis genomes and describe how other organisms (e.g. Zebrafish, Drosophila) can easily be accommodated by this software platform. To accommodate rapid analysis of many mutants from large scale genetic screens, CloudMap contains an in silico complementation testing tool which allows users to rapidly identify instances where multiple alleles of the same gene are present in the mutant collection. Lastly, we describe the application of a novel mapping/WGS method (“Variant Discovery Mapping”) that does not rely on a defined polymorphic mapping strain and we integrate the application of this method into CloudMap. CloudMap tools and documentation are continually updated at http://usegalaxy.org/cloudmap

Hawaiian Variant Mapping

Map a mutation by plotting recombination frequencies resulting from crossing to a highly polymorphic strain.

This tool improves upon, and automates, the method described in Doitsidou et al., PLoS One 2010 for mapping causal mutations using whole genome sequencing data.

The polymorphic Hawaiian strain CB4856 is used as a mapping strain in most cases but in principle any sequenced nematode strain that is significantly different from the mutant strain can be used for mapping. The tool plots the ratio of mapping strain (Hawaiian)/mutant strain (N2) nucleotides at all SNP positions, reflecting the number of recombinants in the sequenced pool of animals. Chromosomes which contain regions of linkage to the causal mutation will have regions where the ratio of mapping strain (Hawaiian)/total reads will be equal to 0. The scatter plots for such linked regions will have a high number of data points lying exactly on the X axis. A loess regression line is plotted through all the points on a given chromosome giving further accuracy to the linked region.

Each scatter plot has a corresponding frequency plot that displays regions of linked chromosomes where pure parental (mutant strain) alleles are concentrated. 1Mb bins for the 0 ratio SNP positions are colored gray by default and .5Mb bins are colored in red. By default, frequency plots of pure parental alleles are normalized to remove false linkage caused by previously described (Seidel et al. 2008) patterns of genetic incompatibility between Bristol and Hawaiian strains. This normalization can be turned off via a checkbox input form setting.

The experimental design required to generate data for the plots is described in the CloudMap paper (Fig.6A). A representative linked chromosome is shown in Fig6B.

Variant Discovery Mapping

Map a mutation using in silico bulk segregant linkage analysis using variants that are already present in the mutant strain of interest (rather than those introduced by a cross to a polymorphic strain).

Although Hawaiian Variant Mapping is the preferred method for mapping causal mutations in whole genome sequenced strains (see CloudMap Hawaiian Variant Mapping with WGS tool), there remain certain scenarios where alternate mapping approaches are useful. For instance, introducing tens of thousands of Hawaiian variants into a mutant strain may not be desirable for individuals concerned with the possibility that some of these Hawaiian variants may act as modifiers of a given phenotype. Behavioral mutants may be especially vulnerable in this regard. Furthermore, in the case of suppressor screens or other screens that have been performed in a mutant background, it is tedious to recover both the suppressor variant and the starting mutation when picking the F2 progeny required for the Hawaiian Variant Mapping technique. In these scenarios, it is useful to not have to rely on a polymorphic mapping strain like the Hawaiian strain.

A recent study in plants (ABE et al. 2012), uses EMS-induced variants and bulk segregant analysis to map a phenotype-causing mutation. We have developed a similar method, which we call “Variant Discovery Mapping”. Our method makes use of background variants in addition to EMS-induced variants (including indels as well as SNPs), and also uses the bulk segregant approach.

The conceptual strategy of variant discovery mapping is to perform in silico bulk segregant linkage analysis using variants that are already present in the mutant strain of interest, rather than examining those introduced by a cross to a polymorphic strain. Any individual mutant strain will contain a certain number of homozygous variants compared to the reference genome. These homozygous variants are of two types: 1) those directly induced during mutagenesis (one or more of which are responsible for the mutant phenotype) (Fig.11A red diamonds) and 2) those already present in the background of the parental strain, either because of genetic drift or because of the parental strain containing, for example, a transgene that was integrated into the genome by irradiation (Fig.11A pale blue diamonds).

Following an outcross to a non-parental strain and selection of a pool of F2-mutant recombinants, these homozygous variants will segregate according to their degree of linkage to the phenotype-inducing locus. The degree of linkage will be directly reflected in the allele frequency among the pool of recombinants and this can be represented as scatter plots of the ratio of variant reads/total reads present in the pool of sequenced recombinants (Fig.11A). We then plot a loess regression line through all the points on a given chromosome to give greater accuracy to the mapping region (Fig.11B). The loess lines on scatter plots for linked chromosomes approach 1, indicating retention of the original homozygous variants in the linked region. We also draw corresponding frequency plots that display regions of linked chromosomes where pure parental allele variant positions are concentrated (positions where the ratio of variant reads/total reads are equal to 1) (Fig.11B). 1Mb bins for the 0 ratio SNP positions are colored gray by default and .5Mb bins are colored in red.

EMS Density Mapping

Map a mutation by linkage to regions of high mutation density using WGS data.

Following the approach detailed in Zuryn et al., Genetics 2010, this tool plots histograms of variant density in a mutant C.elegans strain that has been backcrossed to its (pre-mutagenesis) starting strain. Common (i.e. non-phenotype causing) variants present in multiple WGS strains with the same background should first be subtracted using the GATK tool Select Variants.

Sample output where LG III shows linkage to the causal mutation is shown below. In this example, common variants from another strain have been subtracted and remaining variants have been filtered for most common EMS-induced mutations i.e. G/C --> A/T):

The experimental approach is detailed in Figure 1a from Zuryn et al., Genetics 2010:

Subtracting common (non-phenotype causing) variants from more whole genome sequenced strains (using GATK Tools Select Variants) will result in less noise and a tighter mapping region. Additional backcrosses will also result in a smaller mapping region.

CloudMap Materials:

User guides

Video user guide demonstrating the Hawaiian Variant Mapping workflow using the ot266 proof of principle dataset from the CloudMap paper: 

https://vimeo.com/51082571

Note: In the interest of allowing users to quickly run a Hawaiian Variant Mapping example, the ot266 FASTQ sample dataset is a small subset of all the ot266 reads. For this reason, plots and variant lists generated by the example will not exactly match the ot266 figures in the CloudMap paper.

Video user guides demonstrating all workflows:

See www.hobertlab.org/cloudmap

PDF user guide:

Dataset 'CloudMap_Userguide_11-28-2012_large.pdf'

Dataset 'CloudMap_Userguide_11-28-2012_small.pdf'

Workflows

Hawaiian Variant Mapping workflow using the ot266 proof of principle dataset from the CloudMap paper (workflow can be used for any strain that has been crossed to a mapping strain e.g. Hawaiian):

Workflow: CloudMap Hawaiian and Variant Discovery Mapping on Hawaiian Mapped samples (includes variant calling)_2-7-2014

Variant Discovery Mapping (VDM) workflow:

The VDM method works by using the variants in the mutant strain for mapping (EMS induced variants & variants caused by genetic drift relative to N2 prior to EMS of the strain), not the variants in the crossing strain. In the case of crossing your mutant to N2, you'd want to ideally sequence your N2 and subtract only the variants in your N2 (relative to the published reference). This allows you to use the two classes of mutations for mapping (EMS & genetic drift variants). If you haven't sequenced your N2 crossing strain, you can deduce the crossing strain mutations by selecting mutations that are common in several strains (of a different genetic background) that have been crossed to the N2 crossing strain. Obviously, sequencing the N2 is better, but this alternative  method has worked for us. The key point is that if you're creating your list of crossing strain variants by pooling strains of the same background, you will potentially be subtracting the genetic drift mutations in your starting strain and thus losing variants that could be used for mapping. In the CloudMap paper, fig 11b middle panel shows the mapping with just crossing variants subtracted and the bottom panel shows mapping with both crossing and background mutations subtracted. You can see that subtracting only crossing strain variants works best. 

A VDM workflow that starts from a raw FASTQ file:

Workflow: CloudMap Variant Discovery Mapping (includes variant calling)_2-7-2014

Input files for this workflow:

1) Raw FASTQ file

2) Variants in the crossing strain (e.g. variants in your N2 strain that you used to backcross) in VCF format. (You can generate this list by running the CloudMap Unmapped Mutant workflow on your crossing strain and taking the resulting homozygous variants output file)

3) Fasta reference file available in the CloudMap data library

4) List of candidate genes for annotation (e.g. transcription factors). NOTE: All variants will be annotated regardless of what candidate gene list is used. The candidate gene list genes will simply by marked in the last column in the annotated output file for filtering purposes.

Shared history example:

History: CloudMap ot642HA_Variant Discovery Mapping 2-7-2014

A VDM workflow that starts from VCF files generated by the CloudMap unmapped mutant workflow:

Input files for this workflow: 

1) Homozygous and heterozygous variants as called by the GATK Unified Genotyper in either the unmapped mutant workflow or Hawaiian Mapped workflow (if mutant was Hawaiian mapped).

2) List of crossing strain variants. Depending on the cross performed, these can be the variants in your sequenced N2 relative to the published N2, the list of Hawaiian unfiltered variants available in the CloudMap Shared Data Library on Galaxy, or a list of variants from some other crossing strain that you have sequenced.  

3) The WS220 fasta reference file available in the CloudMap Shared Data Library on Galaxy.

CloudMap Variant Discovery Mapping (Subtracts Crossing Strain from list of Homozygous and Heterozygous Variants called by GATK Unified Genotyper default settings) 12-22-2013

Sample history using the VDM workflow:

History 'CloudMap_Variant_Discovery_Mapping_history_12-21-2013'

EMS Variant Density Mapping workflow (takes VCF of heterozygous and homozygous background variants to subtract):

Workflow 'CloudMap EMS Variant Density Mapping workflow (takes VCF of heterozygous and homozygous variants to subtract)'

Note: Unmapped mutant workflow can be used to generate the VCF of heterozygous and homozygous variants to subtract from the primary sample to be mapped using EMS variant density)

EMS Variant Density Mapping workflow (takes FASTQ reads from a second sample, creates VCF, and subtracts those variants from the primary sample): 

Coming soon...

Unmapped mutant workflow (no variants from other strains to subtract): 

Workflow 'CloudMap Unmapped Mutant workflow'

Unmapped mutant workflow (allows for subtraction of variants from other strains) 

Workflow 'CloudMap Unmapped Mutant workflow (w/ subtraction of other strains)'

Uncovered Region Subtraction workflow (allows for subtraction of uncovered regions from other strains)

Workflow 'Cloudmap Uncovered Region Subtraction workflow'

Subtract variants workflow (1 set of candidates, 2 sets of variants to subtract) 

Workflow 'CloudMap Subtract Variants workflow (1 set candidates, 2 sets of variants to subtract)'

Shared Histories

Shared history from the ot266 proof of principle dataset from the CloudMap paper (all the files generated from the workflow above):

History 'CloudMap_ot266_Proof_of_Principle (with hidden data)'

History 'CloudMap_ot266_Proof_of_Principle (with unhidden data)'

CloudMap Tools & Data

CloudMap tools can be downloaded from the Galaxy toolshed. They can be run in a local Galaxy install, or run as standalone Python scripts on a computer that has Python and R installed, or in Galaxy on Amazon's Elastic Compute (EC2) cloud service:

http://toolshed.g2.bx.psu.edu/

Shared data library (for use case examples from the paper and user guide and also contains key references files):

Go to  http://usegalaxy.org/library and search for CloudMap

Alternative ways to run Galaxy & CloudMap

Galaxy and CloudMap on Amazon's Elastic Compute Cloud (EC2):

http://wiki.g2.bx.psu.edu/CloudMan

Running Galaxy locally:

http://wiki.g2.bx.psu.edu/Admin/Get%20Galaxy