This document is a live copy of supplementary materials for the manuscript. It provides access to all the data as well as to exact analyses and workflows discussed in the paper, so you can play with them by re-running, changing parameters, or even applying them to your own sequencing data. To import workflows you must create a Galaxy account (unless you already have one) – a hassle-free procedure where you are only asked for a username and password. To make this even easier, we created several screencasts (very short movies) to help you:
In addition, we created two longer screenacasts:
If you experience any problems while using this page, please e-mail our bug report list and we will get back to you.
All datasets discussed in the paper can be found in two places:
From there these datasets can either be downloaded or re-analyzed with Galaxy as described here. The name of each dataset is formatted as [family]-[tissue][individual]-[PCR replicate] where family is "F4", "F7", or "F11", tissue is either "c" (cheek swab of buccal tissue) or "b" (blood), individual is an individual id, and PCR replicate is either 1 or 2. For example, F4-bM4C2-1 means PCR replicate 1 from blood of individual M4C2 from family 4. The relationship among individuals is shown below (Numbers in parenthesis = age of each individual; number at the bottom of each table = count of sequencing reads):
The manuscript describes a workflow that we have used to identify heteroplasmies in our samples. In fact there are two workflows. They are identical with the exception of the number of inputs each takes. The first workflow takes two inputs and should be used for the analysis of samples in which two PCR replicates were performed (such as samples M5-G, M4. M4-C1, M10, M10-C2, M15, and M15-C2 in our study). The workflows can be viewed, imported, and edited directly from this page (see a Screencast explaining how to do this):
The second workflow takes only one input and should be used for processing of samples where PCR was not performed in replicates (such as M9 and M4-C3 in our study):
Both of these workflows output data in the following format:
chrM 8990 8991 C 0 1978 0 2 1980 0 chrM 8991 8992 C 0 1933 0 1 1934 0 chrM 8992 8993 C 0 1292 0 587 1879 2 chrM 8993 8994 T 0 2 0 1678 1680 0 chrM 8994 8995 G 2 0 1762 3 1767 0
where columns are defined as follows:
For example, one can clearly see a heteroplasmy at position 8992 with high C (1292) and T (587) counts.
In total we analyzed 32 datasets by running workflows 18 times (for each individual workflows were run separately on blood and cheek samples):
for this we created three separate histories: one for each family. Each history (F4 = Family 4, F7 = Family 7, F11 = Family 11) can be examined in detail and imported below (see a Screencast explaining how to do this):[Do not edit this block; Galaxy will fill it in with the annotated history when it is displayed.]
Each of the histories contain original Illumina datasets and outputs of workflows.
In the previous step we identified variable sites in all samples. Now we need to merge the results by generating reports for each family. To do this we first copied results workflow executions into a new history called "F4-F7-F11 final report" (for explanation on how to copy datasets between histories see this Screencast):
Within this history individual datasets are merged into summaries generated for each family. To be more specific, datasets 1 through 10 were merged into dataset 19 called "F4 summary", datasets 11 - 14 were joined into history item 22 called "F7 summary", and, finally, datasets 15 - 18 were used to generate #24 called "F11 summary". Merging of datasets was performed with "Join, Subtract, and Group -> Column Join" tool. Let's look at dataset "F7 summary" to understand what this means:
the first four columns are (1) chromosome, (2) start, (3) end, and (4) reference base. The remaining 24 columns are in fact four sets of six columns (6 X 4 = 24; each set containins  count of As,  count of Cs,  count of Gs,  count of Ts,  coverage, and the number of  variants) representing blood of M10, cheek of M10, blood of M10C2, and finally cheek of M10C2. Dataset "F11 summary" also contains 24 columns (because Family 11 also contains two studied individuals: M15 and M15C2), while Dataset "F4 Summary" contains 64 columns as this family has five individuals (5 individuals X 2 tissues X 6 columns = 60 columns plus four initial columns containing chromosome, start, end, and reference base data).
Datasets "F4 summary", "F7 summary", and "F11 summary" described in the previous section contain a mix of invariable position and sites containing true variants. This is why each of these datasets contains 15,000+ rows. So before we can make sense of these, we need to filter out all sites that are invariable in all individuals and tissues. This can be done by filtering on the variant count column of each tissue/individual. In each set of six columns variant column is the last. Such filtering is done with "Filter and Sort -> Filter" tool. This is how datasets #22 (F4 variable sites), #23 (F7 variable sites), and #24 (F11 variable sites) were generated (to see exact setting of the Filter tool you can import this history into your workspace and click on the rerun button adjacent to any of these datasets).These datasets were visualized in Galaxy (see datasets #25, #26, and #27 in history 'F4-F7-F11 final report') and were downloaded for further processing. At this point the data is so 'distilled' down to variable sites, that it can be easily explored using a standard spreadsheet application. You can access annotated spreadsheet versions of these datasets below:
To perform the same analysis on the cloud one would need to do two things (both of these steps are explained in this screencast):
To start up your own personal Galaxy on the Amazon Cloud (it looks and feels exactly as Galaxy at http://usegalaxy.org) follow steps outlined at http://usegalaxy.org/cloud. When initializing an instance allocate 100G of disk space at startup and add 10 working cluster nodes (this screencast explains how).
Once your instance is up and running you need to (to see advance this screencast to time point 7:56):