Elephantid genomes reveal the molecular bases of Woolly Mammoth adaptations to the arctic
Vincent J. Lynch1*, Oscar C. Bedoya-Reina2,3, Aakrosh Ratan2,4, Michael Sulak1, Daniela I. Drautz-Moses2,5, George H. Perry6, Webb Miller2,*, Stephan C. Schuster2,5
1Department of Human Genetics, The University of Chicago, 920 E. 58th Street, CLSC 319C, Chicago, IL 60637, USA.
2Center for Comparative Genomics and Bioinformatics, Pennsylvania State University, 506B Wartik Lab, University Park, PA 16802, USA.
3Current address: MRC Functional Genomics Unit, Department of Physiology, Anatomy and Genetics, University of Oxford, South Parks Road, Oxford, OX1 3PT, UK.
4Current address: Department of Public Health Sciences and Center for Public Health Genomics, University of Virginia, Charlottesville, Virginia, USA.
5Current address: Singapore Centre on Environmental Life Sciences Engineering, Nanyang Technological University, 60 Nanyang Drive, SBS-01N-27, Singapore 637551.
6Departments of Anthropology and Biology, Pennsylvania State University, 513 Carpenter Building, University Park, PA 16802, USA.
*Correspondence: email@example.com, firstname.lastname@example.org
We make avallable three Galaxy dataset associated with this paper. You can find them under "Mammoth, Cell Reports" in the Genome Diversity library. The first dataset
holds the 33,155,215 putative SNPs, i.e., positions in the genome assembly called loxAfr3 of the African savannah elephant where we observed a variant allele in at least one of the 3 asian elephants and/or 2 mammoths that we sequenced. This is a tab-separated file with the following columns:
1. scaf - scaffold name 2. pos - position 3. A - reference (loxAfr3) allele 4. B - variant allele 5. Aqual - Phred scaled probability of the alternate allele in the Indian elephants 6. Mqual - Phred scaled probability of the alternate allele in the mammoths For each of 5 individual elephantidae samples there are four columns, giving count of the first allele, count of the second allele, SAMtools genotype (number of copies of the first allele), and quality of the called genotype. The values occupy columns 7-26. 7-10. individual 1 -- Asian1 11-14. individual 2 -- Asian2 15-18. individual 3 -- Asian3 19-22. individual 4 -- M4 23-26. individual 5 -- M25 There is a considerable amount of DNA damage observed for samples M4 and M25 at the ends of the reads. In some cases, this can lead to spurious variant calls. In order to compensate for the DNA damage, we replace the putative damaged bases with "N"'s and only use reads longer than 20 bps and recalculate the read counts supporting the two alleles. These recalculated read counts occupy the columns 27-34. 27-30. individual 4 -- M4.1 31-34. individual 5 -- M25.1
The second dataset that we make available
identifies the 170,274 SNPs lying in protein-coding regions, as annotated by Ensembl. This is a tab-separated file with the following columns:
1. ref - loxAfr3 scaffold 2. rPos - position on loxAfr3 scaffold 3. trns - ENSEMBL transcript name 4. gene - common name of gene (or "N" if the elephant gene model isn't named) 5. AA1 - one amino acid 6. loc - location in the peptide sequence 7. AA2 - variant amino acid
We also provide a command-history
that uses the two dataset to prepare a table
with 2,064 putative woolly mammoth-specific amino acids, i.e., positions in the African elephant assembly where the three Asian elephants appeared to be homozygous for the African reference nucleotide, but the two mammoths appear homozygous for the non-synonymous variant nucleotide
1. gene name 2. elephant amino acid 3. peptide position 4. woolly mammoth amino acid 5. ENSEMBL transcript name 6. Loxodonta (loxAfr3) scaffold 7. position in scaffold 8. human (hg19) orthologous chromosome ("none" in 79 cases) 9. chromosomal position (-1 in 79 cases) 10. Phred scaled probability of the alternate allele in the mammoths; maximum is 999
The third dataset in the folder "Mammoth, Cell Reports" in the Genome Diversity library adds three columns that we computed "off-line". (The cannot be computed in Galaxy.) One gives the blosum80 substitution score for the amino-acids in columns 2 and 4. (Negative values indicate the amino acids are rarely exchanged during evolution.) Two additional columns give the Polyphen2 estimation ("benign", "possibly damaging", "probably damaging" or "unknown") and Polyphen2 score. (In the course of applying Polyphen, we lost 18 rows, and the table contains only the 2,046 amino acid differences, including introduction of a stop codon, mentioned in the paper.)