Elephantid genomes reveal the molecular bases of Woolly Mammoth adaptations to the arctic

Vincent J. Lynch1*, Oscar C. Bedoya-Reina2,3, Aakrosh Ratan2,4, Michael Sulak1, Daniela I. Drautz-Moses2,5, George H. Perry6, Webb Miller2,*, Stephan C. Schuster2,5

 

1Department of Human Genetics, The University of Chicago, 920 E. 58th Street, CLSC 319C, Chicago, IL 60637, USA.

2Center for Comparative Genomics and Bioinformatics, Pennsylvania State University, 506B Wartik Lab, University Park, PA 16802, USA.

3Current address: MRC Functional Genomics Unit, Department of Physiology, Anatomy and Genetics, University of Oxford, South Parks Road, Oxford, OX1 3PT, UK.

4Current address: Department of Public Health Sciences and Center for Public Health Genomics, University of Virginia, Charlottesville, Virginia, USA.

5Current address: Singapore Centre on Environmental Life Sciences Engineering, Nanyang Technological University, 60 Nanyang Drive, SBS-01N-27, Singapore 637551.

6Departments of Anthropology and Biology, Pennsylvania State University, 513 Carpenter Building, University Park, PA 16802, USA.

 

*Correspondence: vjlynch@uchicago.edu, webb@bx.psu.edu

We make avallable three Galaxy dataset associated with this paper. You can find them under "Mammoth, Cell Reports" in the Genome Diversity library. The first dataset

Dataset 'mammoth SNPs'

holds the 33,155,215 putative SNPs, i.e., positions in the genome assembly called loxAfr3 of the African savannah elephant where we observed a variant allele in at least one of the 3 asian elephants and/or 2 mammoths that we sequenced. This is a tab-separated file with the following columns:

    1. scaf - scaffold name
    2. pos  - position
    3. A    - reference (loxAfr3) allele
    4. B    - variant allele
    5. Aqual - Phred scaled probability of the alternate allele in the Indian elephants
    6. Mqual - Phred scaled probability of the alternate allele in the mammoths

For each of 5 individual elephantidae samples there are four columns,
giving count of the first allele, count of the second allele, SAMtools
genotype (number of copies of the first allele), and quality of the
called genotype.  The values occupy columns 7-26.

 7-10. individual 1 -- Asian1
11-14. individual 2 -- Asian2
15-18. individual 3 -- Asian3
19-22. individual 4 -- M4
23-26. individual 5 -- M25

There is a considerable amount of DNA damage observed for samples M4 and
M25 at the ends of the reads.  In some cases, this can lead to spurious
variant calls.  In order to compensate for the DNA damage, we replace
the putative damaged bases with "N"'s and only use reads longer than
20 bps and recalculate the read counts supporting the two alleles.
These recalculated read counts occupy the columns 27-34.

27-30. individual 4 -- M4.1
31-34. individual 5 -- M25.1

The second dataset that we make available

Dataset 'mammoth coding SNPs'

identifies the 170,274 SNPs lying in protein-coding regions, as annotated by Ensembl. This is a tab-separated file with the following columns:

  1. ref  - loxAfr3 scaffold
  2. rPos - position on loxAfr3 scaffold
  3. trns - ENSEMBL transcript name
  4. gene - common name of gene (or "N" if the elephant gene model isn't named)
  5. AA1  - one amino acid
  6. loc  - location in the peptide sequence
  7. AA2  - variant amino acid

We also provide a command-history

History 'mammoth in Cell Reports'

that uses the two dataset to prepare a table

Dataset 'sort by gene name to get the final table (minus "exchangeabilty" and PolyPhen2 estimations (computed off-line)'

with 2,064 putative woolly mammoth-specific amino acids, i.e., positions in the African elephant assembly where the three Asian elephants appeared to be homozygous for the African reference nucleotide, but the two mammoths appear homozygous for the non-synonymous variant nucleotide

   1. gene name
   2. elephant amino acid
   3. peptide position
   4. woolly mammoth amino acid
   5. ENSEMBL transcript name
   6. Loxodonta (loxAfr3) scaffold
   7. position in scaffold
   8. human (hg19) orthologous chromosome ("none" in 79 cases)
   9. chromosomal position (-1 in 79 cases)
  10. Phred scaled probability of the alternate allele in the mammoths; maximum is 999

The third dataset in the folder "Mammoth, Cell Reports" in the Genome Diversity library adds three columns that we computed "off-line". (The cannot be computed in Galaxy.) One gives the blosum80 substitution score for the amino-acids in columns 2 and 4. (Negative values indicate the amino acids are rarely exchanged during evolution.) Two additional columns give the Polyphen2 estimation ("benign", "possibly damaging", "probably damaging" or "unknown") and Polyphen2 score. (In the course of applying Polyphen, we lost  18 rows, and the table contains only the 2,046 amino acid differences, including introduction of a stop codon, mentioned in the paper.)