Gregory Minevich 1,§, Danny S. Park 1, Daniel Blankenberg 2, Richard J. Poole 1,3,§, and Oliver Hobert 1,§
1 Department of Biochemistry and Molecular Biophysics, Howard Hughes Medical Institute, Columbia University Medical Center, New York, NY, USA
2 Center for Comparative Genomics and Bioinformatics, Penn State University, University Park, PA, USA
3 Present address: Department of Cell & Developmental Biology, University College London, London WC1E 6B
§Correspondence to email@example.com (G.M.), firstname.lastname@example.org (R.J.P.) or email@example.com (O.H.)
Whole genome sequencing (WGS) allows researchers to pinpoint genetic differences between individuals and significantly shortcuts the costly and time-consuming part of forward genetic analysis in model organism systems. Currently, the most effort-intensive part of WGS is the bioinformatic analysis of the relatively short reads generated by second generation sequencing platforms. We describe here a novel, easily accessible and cloud-based pipeline, called CloudMap, which greatly simplifies the analysis of mutant genome sequences. Available on the Galaxy web platform, CloudMap requires no software installation when run on the cloud, but it can also be run locally or via Amazon’s Elastic Compute Cloud (EC2) service. CloudMap uses a series of pre-defined workflows to pinpoint sequence variations in animal genomes, such as those of pre-mutagenized and mutagenized Caenorhabditis elegans strains. In combination with a variant-based mapping procedure, CloudMap allows users to sharply define genetic map intervals graphically and to retrieve very short lists of candidate variants with a few simple clicks. Automated workflows and extensive video user guides are available to detail the individual analysis steps performed (http://usegalaxy.org/cloudmap). We demonstrate the utility of CloudMap for WGS analysis of C. elegans and Arabidopsis genomes and describe how other organisms (e.g. Zebrafish, Drosophila) can easily be accommodated by this software platform. To accommodate rapid analysis of many mutants from large scale genetic screens, CloudMap contains an in silico complementation testing tool which allows users to rapidly identify instances where multiple alleles of the same gene are present in the mutant collection. Lastly, we describe the application of a novel mapping/WGS method (“Variant Discovery Mapping”) that does not rely on a defined polymorphic mapping strain and we integrate the application of this method into CloudMap. CloudMap tools and documentation are continually updated at http://usegalaxy.org/cloudmap.
This tool improves upon, and automates, the method described in Doitsidou et al., PLoS One 2010 for mapping causal mutations using whole genome sequencing data.
The polymorphic Hawaiian strain CB4856 is used as a mapping strain in most cases but in principle any sequenced nematode strain that is significantly different from the mutant strain can be used for mapping. The tool plots the ratio of mapping strain (Hawaiian)/mutant strain (N2) nucleotides at all SNP positions, reflecting the number of recombinants in the sequenced pool of animals. Chromosomes which contain regions of linkage to the causal mutation will have regions where the ratio of mapping strain (Hawaiian)/total reads will be equal to 0. The scatter plots for such linked regions will have a high number of data points lying exactly on the X axis. A loess regression line is plotted through all the points on a given chromosome giving further accuracy to the linked region.
Each scatter plot has a corresponding frequency plot that displays regions of linked chromosomes where pure parental (mutant strain) alleles are concentrated. 1Mb bins for the 0 ratio SNP positions are colored gray by default and .5Mb bins are colored in red. By default, frequency plots of pure parental alleles are normalized to remove false linkage caused by previously described (Seidel et al. 2008) patterns of genetic incompatibility between Bristol and Hawaiian strains. This normalization can be turned off via a checkbox input form setting.
The experimental design required to generate data for the plots is described in the CloudMap paper (Fig.6A). A representative linked chromosome is shown in Fig6B.
Although Hawaiian Variant Mapping is the preferred method for mapping causal mutations in whole genome sequenced strains (see CloudMap Hawaiian Variant Mapping with WGS tool), there remain certain scenarios where alternate mapping approaches are useful. For instance, introducing tens of thousands of Hawaiian variants into a mutant strain may not be desirable for individuals concerned with the possibility that some of these Hawaiian variants may act as modifiers of a given phenotype. Behavioral mutants may be especially vulnerable in this regard. Furthermore, in the case of suppressor screens or other screens that have been performed in a mutant background, it is tedious to recover both the suppressor variant and the starting mutation when picking the F2 progeny required for the Hawaiian Variant Mapping technique. In these scenarios, it is useful to not have to rely on a polymorphic mapping strain like the Hawaiian strain.
A recent study in plants (ABE et al. 2012), uses EMS-induced variants and bulk segregant analysis to map a phenotype-causing mutation. We have developed a similar method, which we call “Variant Discovery Mapping”. Our method makes use of background variants in addition to EMS-induced variants (including indels as well as SNPs), and also uses the bulk segregant approach.
The conceptual strategy of variant discovery mapping is to perform in silico bulk segregant linkage analysis using variants that are already present in the mutant strain of interest, rather than examining those introduced by a cross to a polymorphic strain. Any individual mutant strain will contain a certain number of homozygous variants compared to the reference genome. These homozygous variants are of two types: 1) those directly induced during mutagenesis (one or more of which are responsible for the mutant phenotype) (Fig.11A red diamonds) and 2) those already present in the background of the parental strain, either because of genetic drift or because of the parental strain containing, for example, a transgene that was integrated into the genome by irradiation (Fig.11A pale blue diamonds).
Following an outcross to a non-parental strain and selection of a pool of F2-mutant recombinants, these homozygous variants will segregate according to their degree of linkage to the phenotype-inducing locus. The degree of linkage will be directly reflected in the allele frequency among the pool of recombinants and this can be represented as scatter plots of the ratio of variant reads/total reads present in the pool of sequenced recombinants (Fig.11A). We then plot a loess regression line through all the points on a given chromosome to give greater accuracy to the mapping region (Fig.11B). The loess lines on scatter plots for linked chromosomes approach 1, indicating retention of the original homozygous variants in the linked region. We also draw corresponding frequency plots that display regions of linked chromosomes where pure parental allele variant positions are concentrated (positions where the ratio of variant reads/total reads are equal to 1) (Fig.11B). 1Mb bins for the 0 ratio SNP positions are colored gray by default and .5Mb bins are colored in red.
Map a mutation by linkage to regions of high mutation density using WGS data.
Following the approach detailed in Zuryn et al., Genetics 2010, this tool plots histograms of variant density in a mutant C.elegans strain that has been backcrossed to its (pre-mutagenesis) starting strain. Common (i.e. non-phenotype causing) variants present in multiple WGS strains with the same background should first be subtracted using the GATK tool Select Variants.
Sample output where LG III shows linkage to the causal mutation is shown below. In this example, common variants from another strain have been subtracted and remaining variants have been filtered for most common EMS-induced mutations i.e. G/C --> A/T):
The experimental approach is detailed in Figure 1a from Zuryn et al., Genetics 2010:
Subtracting common (non-phenotype causing) variants from more whole genome sequenced strains (using GATK Tools Select Variants) will result in less noise and a tighter mapping region. Additional backcrosses will also result in a smaller mapping region.
The VDM method works by using the variants in the mutant strain for mapping (EMS induced variants & variants caused by genetic drift relative to N2 prior to EMS of the strain), not the variants in the crossing strain. In the case of crossing your mutant to N2, you'd want to ideally sequence your N2 and subtract only the variants in your N2 (relative to the published reference). This allows you to use the two classes of mutations for mapping (EMS & genetic drift variants). If you haven't sequenced your N2 crossing strain, you can deduce the crossing strain mutations by selecting mutations that are common in several strains (of a different genetic background) that have been crossed to the N2 crossing strain. Obviously, sequencing the N2 is better, but this alternative method has worked for us. The key point is that if you're creating your list of crossing strain variants by pooling strains of the same background, you will potentially be subtracting the genetic drift mutations in your starting strain and thus losing variants that could be used for mapping. In the CloudMap paper, fig 11b middle panel shows the mapping with just crossing variants subtracted and the bottom panel shows mapping with both crossing and background mutations subtracted. You can see that subtracting only crossing strain variants works best.
A VDM workflow that starts from a raw FASTQ file:
Input files for this workflow:
1) Raw FASTQ file
2) Variants in the crossing strain (e.g. variants in your N2 strain that you used to backcross) in VCF format. (You can generate this list by running the CloudMap Unmapped Mutant workflow on your crossing strain and taking the resulting homozygous variants output file)
3) Fasta reference file available in the CloudMap data library
4) List of candidate genes for annotation (e.g. transcription factors). NOTE: All variants will be annotated regardless of what candidate gene list is used. The candidate gene list genes will simply by marked in the last column in the annotated output file for filtering purposes.
A VDM workflow that starts from VCF files generated by the CloudMap unmapped mutant workflow:
Input files for this workflow:
1) Homozygous and heterozygous variants as called by the GATK Unified Genotyper in either the unmapped mutant workflow or Hawaiian Mapped workflow (if mutant was Hawaiian mapped).
2) List of crossing strain variants. Depending on the cross performed, these can be the variants in your sequenced N2 relative to the published N2, the list of Hawaiian unfiltered variants available in the CloudMap Shared Data Library on Galaxy, or a list of variants from some other crossing strain that you have sequenced.
3) The WS220 fasta reference file available in the CloudMap Shared Data Library on Galaxy.
Sample history using the VDM workflow:
Note: Unmapped mutant workflow can be used to generate the VCF of heterozygous and homozygous variants to subtract from the primary sample to be mapped using EMS variant density)
CloudMap Tools & Data
Go to http://usegalaxy.org/library and search for CloudMap
Alternative ways to run Galaxy & CloudMap