High-throughput analysis of large and possibly unassembled genomes


Aakrosh Ratan, Oscar C. Bedoya-Reina, Richard Burhans, Robert S. Harris, Cathy Riemer, Yu Zhang, George H. Perry, Stephan C. Schuster, Webb Miller


It is currently difficult for a small group of investigators to effectively analyze their sequence data from a non-model organism or even, in some cases, to use such data generated by another group. Here we describe software systems for addressing this issue. Our focus is especially on understanding intra-species genetic diversity and its potential phenotypic and wildlife-conservation consequences. We generate low-coverage sequence data from multiple individuals, possibly in conjunction with deep coverage of a single individual. One set of tools prepares tables of genetic variants for the Galaxy web server, and the other tools run on Galaxy to analyze those data. Frequently, we avoid the creation of a de novo assembly of the genome of interest, and rely instead on an annotated “reference” genome of a related species. Here we illustrate use of the tools and evaluate their effectiveness using sequences generated by our group, as well as data published by other groups. We show that a strong signal of positive selection can be robustly identified even when the focus and reference lineages separated over 50 million years ago. Both sets of tools — those for creating the Galaxy tables and those for analyzing the tables via Galaxy— are freely available.

Data Sets

Many of the analyses reported in the paper were based on the ten data sets given here. (You can also find them under Shared Data -> Data LIbraries -> Genome Diversity, then under chicken, stickleback and human.)

The first data set contains 7,285,024 putative chicken SNPs, each recorded in a row with 45 columns. These are described in the following paper and were provided by the authors.

Rubin CJ, Zody MC, Eriksson J, Meadows JR, Sherwood E, Webster MT, Jiang L, Ingman M, Sharpe T, Ka S, Hallbook F, Besnier F, Carlborg O, Bed'hom B, Tixier-Boichard M, Jensen P, Siegel P, Lindblad-Toh K, Andersson L. (2010) Whole-genome resequencing reveals loci under selection during chicken domestication, Nature 464:587-591. PMID: 20220755

The second data set contains 91,083 chicken "SAPs" (Single Amino-acid Polymorphisms, including synonymous coding differences), each with 7 columns.

Third is a set of 15,368 chicken gene names, each with 5 columns.

We also proved a table of putative selective sweeps in domestic chickens, from Rubin et al. (2010), each with 12 columns.

Another example in the paper uses putative stickleback SNPs that we downloaded from the Stickleback Browser, retaining only the 1,870,135 that could be mapped to the fugu genome, each with 90 columns.

We also studied the 32,795 fugu-based stickleback SAPs, each with 7 columns.

Jones et al. (2012) report 81 genomic intervals identified by two complex methods for detecting signals of positive selection. Each interval has 12 columns.

Jones FC, Grabherr MG, Chan YF, Russell P, Mauceli E, Johnson J, Swofford R, Pirun M, Zody MC, White S, et al. 2012. The genomic basis of adaptive evolution in threespine sticklebacks. Nature 484: 55-61. 

For the LCT example we used a set of 8,598,051 genome-wide SNPs called from a synthesized set of low-coverage sequence data from 12 human individuals, each SNP given with 53 columns.

We predicted 328,366 human SNPs in the first 140 Mb of chromosome 2 using the assembly of rhesus chromosome 13 as the reference; each SNP has 56 columns.

Finally, we predicted 52,227 human SNPs in the same region using the dog genome assembly as reference; each SNP has 56 columns.


The workflows contain commands for the main analyses reported in the body of the paper. The user is invited to modify the commands to compute more of the results described in the main paper and supplement. Many of the Galaxy tools used in these workflows can be found under "Genome Diversity" in the left panel on the Analyze Data page. A tutorial can be found under Example 4 on this page.

The first workflow searches for evidence of selective sweeps in all domestic lines, as well as just in commercial broilers. The workflow needs to be applied to the "chicken SNPs" and "chicken genes" data sets as follows: (1) Under "Analyze Data" (in the black bar) create an empty history. (2) Under "Shared Data" -> "Published Pages", view this page. (3) Import the "chicken SNPs" data set ("+" in the green circle near the right of the green bar), then click on "return to the previous page". (4) Similarly, import the "chicken genes" data sets. (5) Import the "pipeline chicken" workflow, and click on "start using this workflow". (6) You will be taken to your Workflow page, which will have a workflow called "imported pipeline chicken"; click on it and select "run". (7) You will be taken to a history that includes the "chicken SNPs" and "chiicken genes" and the "pipeline chicken" workflow; check that the inputs to the workflow are properly assigned, scroll to the bottom of the workflow (middle panel) and press "Run workflow". 

The second workflow compares the selective sweeps (in freshwater compared to marine individuals) found by our method with the 81 regions reported by Jones et al. (2012). It uses two data sets: fugu-based stickleback SNPs and 81 sweeps from the stickleback paper.

One of the workflows associated with the LCT example looks human-genome-wide for selective sweeps using two notions of Fst. It uses one data set: human SNPs, like aye-aye.

Another workflow provides the data given in Figure 5A. It uses two data sets: human LCT SNPs (rhesus reference) and human SNPs, like aye-aye. Be warned that at one step it maps over 400,000 putative SNPs from human to rhesus coordinates, which takes around an hour to run.

The final workflow provides the data given in Figure 5B. It uses two data sets: human LCT SNPs (dog reference) and human SNPs, like aye-aye. Like the previous workflow, it has a step that runs for around an hour.