Dynamics of mitochondrial heteroplasmy in three families investigated via a repeatable re-sequencing study

Hiroki Goto1, Benjamin Dickins2, Enis Afgan3,5, Ian M. Paul4, James Taylor3,5, Kateryna D. Makova1, and Anton Nekrutenko2,5 

Published in Genome Biology on June 23, 2011
Correspondence should be addressed to KDM, JT, or AN.

1. How to use this document

This document is a live copy of supplementary materials for the manuscript. It provides access to all the data as well as to exact analyses and workflows discussed in the paper, so you can play with them by re-running, changing parameters, or even applying them to your own sequencing data. To import workflows you must create a Galaxy account (unless you already have one) – a hassle-free procedure where you are only asked for a username and password. To make this even easier, we created several screencasts (very short movies) to help you:

In addition, we created two longer screenacasts: 

If you experience any problems while using this page, please e-mail our bug report list and we will get back to you.

2. Accessing the Data

All datasets discussed in the paper can be found in two places:

From there these datasets can either be downloaded or re-analyzed with Galaxy as described here. The name of each dataset is formatted as [family]-[tissue][individual]-[PCR replicate] where family is "F4", "F7", or "F11", tissue is either "c" (cheek swab of buccal tissue) or "b" (blood), individual is an individual id, and PCR replicate is either 1 or 2. For example, F4-bM4C2-1 means PCR replicate 1 from blood of individual M4C2 from family 4. The relationship among individuals is shown below (Numbers in parenthesis = age of each individual; number at the bottom of each table = count of sequencing reads):

Samples used in the study

3. Calling Heteroplasmies

3.1. Workflows

The manuscript describes a workflow that we have used to identify heteroplasmies in our samples. In fact there are two workflows. They are identical with the exception of the number of inputs each takes. The first workflow takes two inputs and should be used for the analysis of samples in which two PCR replicates were performed (such as samples M5-G, M4. M4-C1, M10, M10-C2, M15, and M15-C2 in our study). The workflows can be viewed, imported, and edited directly from this page (see a Screencast explaining how to do this):

The second workflow takes only one input and should be used for processing of samples where PCR was not performed in replicates (such as M9 and M4-C3 in our study):

Both of these workflows output data in the following format:

chrM	8990	8991	C	0	1978	0	2	1980	0
chrM	8991	8992	C	0	1933	0	1	1934	0
chrM	8992	8993	C	0	1292	0	587	1879	2
chrM	8993	8994	T	0	2	0	1678	1680	0
chrM	8994	8995	G	2	0	1762	3	1767	0

where columns are defined as follows:

  1. Chromosome
  2. Start (0-based)
  3. End (1-based)
  4. Reference base
  5. Number of reads containing "A"
  6. Number of reads containing "C"
  7. Number of reads containing "G"
  8. Number of reads containing "T"
  9. Coverage (total number of reads overlaying this position)
  10. Number of variant sites (differences from the reference) with frequency above 1% (0.01)

For example, one can clearly see a heteroplasmy at position 8992 with high C (1292) and T (587) counts. 

3.2. Running workflows on Illumina datasets

In total we analyzed 32 datasets by running workflows 18 times (for each individual workflows were run separately on blood and cheek samples):

  • the workflow 'mt analysis 0.01 strand-specific (fastq double)' was run 14 times on all datasets that contained PCR replicates: M5G, M4, M4C1, M10, M10C2, M15, and M15C2;
  • the workflow 'mt analysis 0.01 strand-specific (fastq single)' was run four times on datasets that lacked PCR replicates: M9 and M4C3;

for this we created three separate histories: one for each family. Each history (F4 = Family 4, F7 = Family 7, F11 = Family 11) can be examined in detail and imported below (see a Screencast explaining how to do this):[Do not edit this block; Galaxy will fill it in with the annotated history when it is displayed.]

Each of the histories contain original Illumina datasets and outputs of workflows.

3.3 Generating initial summary datasets

In the previous step we identified variable sites in all samples. Now we need to merge the results by generating reports for each family. To do this we first copied results workflow executions into a new history called "F4-F7-F11 final report" (for explanation on how to copy datasets between histories see this Screencast):

Within this history individual datasets are merged into summaries generated for each family. To be more specific, datasets 1 through 10 were merged into dataset 19 called "F4 summary", datasets 11 - 14 were joined into history item 22 called "F7 summary", and, finally, datasets 15 - 18 were used to generate #24 called "F11 summary". Merging of datasets was performed with "Join, Subtract, and Group -> Column Join" tool. Let's look at dataset "F7 summary" to understand what this means:

Galaxy Dataset | F7 summary

Results of heteroplasmy workflow for all individuals of family 7 joined together. You can click in "rerun" button above to see the parameters.

the first four columns are (1) chromosome, (2) start, (3) end, and (4) reference base. The remaining 24 columns are in fact four sets of six columns (6 X 4 = 24; each set containins [1] count of As, [2] count of Cs, [3] count of Gs, [4] count of Ts, [5] coverage, and the number of [6] variants) representing blood of M10, cheek of M10, blood of M10C2, and finally cheek of M10C2. Dataset "F11 summary" also contains 24 columns (because Family 11 also contains two studied individuals: M15 and M15C2), while Dataset "F4 Summary" contains 64 columns as this family has five individuals (5 individuals X 2 tissues X 6 columns = 60 columns plus four initial columns containing chromosome, start, end, and reference base data).

3.4. Where are the heteroplasmies?

Datasets "F4 summary", "F7 summary", and "F11 summary" described in the previous section contain a mix of invariable position and sites containing true variants. This is why each of these datasets contains 15,000+ rows. So before we can make sense of these, we need to filter out all sites that are invariable in all individuals and tissues. This can be done by filtering on the variant count column of each tissue/individual. In each set of six columns variant column is the last. Such filtering is done with "Filter and Sort -> Filter" tool. This is how datasets #22 (F4 variable sites), #23 (F7 variable sites), and #24 (F11 variable sites) were generated (to see exact setting of the Filter tool you can import this history into your workspace and click on the rerun button adjacent to any of these datasets).These datasets were visualized in Galaxy (see datasets #25, #26, and #27 in history 'F4-F7-F11 final report') and were downloaded for further processing. At this point the data is so 'distilled' down to variable sites, that it can be easily explored using a standard spreadsheet application. You can access annotated spreadsheet versions of these datasets below:

4. Running the same analysis of the cloud

To perform the same analysis on the cloud one would need to do two things (both of these steps are explained in this screencast):

4.1. Start your own instance on the Amazon Cloud

To start up your own personal Galaxy on the Amazon Cloud (it looks and feels exactly as Galaxy at http://usegalaxy.org) follow steps outlined at http://usegalaxy.org/cloud. When initializing an instance allocate 100G of disk space at startup and add 10 working cluster nodes (this screencast explains how). 

4.2. Importing data and workflows into a Cloud Instance

Once your instance is up and running you need to (to see advance this screencast to time point 7:56):

  1. Create a user account within the newly baked Galaxy;
  2. Add ten working nodes to your Cloud cluster;
  3. Import data and workflows;
  4. Start your analyses.

Affiliations

  1. Department of Biology, Penn State University, University Park, PA, USA
  2. The Huck Institutes for the Life Sciences and Department of Biochemistry and Molecular Biology, Penn State University, University Park, PA, USA
  3. Department of Biology, Emory University, Atlanta, GA, USA
  4. College of Medicine, Penn State University, Hershey, PA, USA
  5. http://usegalaxy.org

Galaxy Team = Enis Afgan, Guru Ananda, Dan Blankenberg, Ramkrishna Chakrabarty, Nate Coraor, Jeremy Goecks, Jennifer Jackson, Greg Von Kuster, Ross Lazarus, Kanwei Li, Sergei Kosakovsky Pond, Anton Nekrutenko, James Taylor, Kelly Vincent