RNA-seq Analysis Exercise

Galaxy provides the tools necessary to creating and executing a complete RNA-seq analysis pipeline. This exercise introduces these tools and guides you through a simple pipeline using some example datasets. Familiarity with Galaxy and the general concepts of RNA-seq analysis are useful for understanding this exercise. This exercise should take 1-2 hours. You can check your work by looking at the history and visualization at the bottom of this page, which contain the datasets for the completed exercise.

Input Datasets

Below are small samples of datasets from the Illumina BodyMap 2.0 project; specifically, the datasets are paired-end 50bp reads from adrenal and brain tissues. The sampled reads map mostly to a 500Kb region of chromosome 19, positions 3-3.5 million (chr19:3000000:3500000).

RNA-seq data from adrenal tissue:

Galaxy Dataset | adrenal_1.fastq

Forward RNA-seq reads from BodyMap 2.0 project, adrenal tissue, mapping to chr19:3000000:3500000


Galaxy Dataset | adrenal_2.fastq

Reverse RNA-seq reads from BodyMap 2.0 project, adrenal tissue, mapping to chr19:3000000:3500000

RNA-seq data from brain tissue:

Galaxy Dataset | brain_1.fastq

Forward RNA-seq reads from BodyMap 2.0 project, brain tissue, mapping to chr19:3000000:3500000


Galaxy Dataset | brain_2.fastq

Reverse RNA-seq reads from BodyMap 2.0 project, brain tissue, mapping to chr19:3000000:3500000

You'll also need one additional dataset: a gene annotation in GTF format. Here is iGenomes gene annotation for the UCSC hg19 build:

Galaxy Dataset | iGenomes UCSC hg19, chr19 gene annotation

iGenomes gene annotation for chr19 of hg19 build

Here's a history containing all five datasets; click on the green plus to import (copy) it into your workspace. Use this history to complete the exercise.

Galaxy History | RNA-seq exercise datasets

Datasets needed for Galaxy RNA-seq exercise

Understanding and QCing the reads

You should understand the reads a bit before analyzing them. Quality control (QC) may be needed as well.

Step 1: Run a quality control check on your data using the [NGS: QC and manipulation >FastQC tool. Often, it is useful to trim reads to remove base positions that have a low median (or bottom quartile) score. For this exercise, assume a median quality score of below 20 to be unusable. Given this criterion, is trimming needed for the datasets? If so, which base pairs should be trimmed? The datasets do not pass FastQC's quality control checks for per-sequence GC content and sequence duplication levels. This is expected because FastQC is designed for DNA sequencing data, not RNA sequencing. What unique characteristics of RNA-seq data will cause these two QC checks to fail?

Step 2: If necessary, trim the reads based on your answers to step A using [NGS: QC and manipulation >FASTQ Quality Trimmer

Map processed reads

The next step is mapping the processed reads to the genome. The major challenge when mapping RNA-seq reads is that the reads, because they come from RNA, often cross splice junction boundaries; splice junctions are not present in a genome's sequence, and hence typical NGS mappers such as Bowtie and BWA are not ideal without modifying the genome sequence. Instead, it is better to use a mapper such as Tophat that is designed to map RNA-seq reads.

Step 1: Use the [NGS: RNA Analysis >Tophat tool to map RNA-seq reads to the hg19 build. Because the reads are paired, you'll need to set mean inner distance between pairs; this is the average distance in basepairs between reads, not the total insert/fragment size. Use a mean inner distance of 110 for BodyMap data.

Look at the documentation to understand the datasets that Tophat produces. What percentage of left reads were aligned (mapped)? What percentage of read pairs aligned uniquely (as opposed to those that have multiple alignments)? How many splice junctions did Tophat find? Are most splice junctions supported by (i.e. spanned by) more or less than 10 reads? (Hint: the score column in the splice junctions dataset is useful for answering this question.)

Step 2: To view Tophat's output, create a simple Galaxy visualization by selecting Visualization > New Track Browser from the main Galaxy menu at the top. Create the visualization using the hg19 build and add datasets to your visualization by clicking on the Add Datasets to Visualization button. Add (a) Tophat's accepted hits BAM datasets, which are the mapped reads; (b) Tophat's splice junctions datasets, which denote the junctions found by mapping the reads; and (c) the iGenomes gene annotation.

Navigate to chr19 using the select box at the top of visualization and look at the data. Zoom in to view the data at greater detail. You can zoom in by (a) double-clicking anywhere on the visualization to zoom in on that area or (b) dragging on the base number area at the top of the visualization to create a zoom area; or (c) clicking on the location bar and entering the region with data, chr19:3000000:3500000

You should be able to see: (a) the reads mapped by Tophat, including reads mapped across introns (there are a lot of reads!); (b) the splice junctions produced by Tophat; and (c) how Tophat's reads and junction correspond to the annotated genes. Find a gene locus where there are mapped reads, and then find an example of a splice junction between 2 known exons, and find an example where a splice junction might be found but is not.

Assemble and analyze transcripts

After mapping the reads, the next step is to assemble the reads into complete transcripts that can be analyzed for differential expression and phenomena such as splicing events and transcriptional start sites.

Step 1: Run [NGS: RNA Analysis >Cufflinks on each BAM dataset produced by Tophat using de novo assembly; this will assemble the reads into transcripts. Here is the documentation for the datasets that Cufflinks produces. Look at the assembled transcripts dataset and find some transcripts that have multiple exons.

Step 2: Add the Cufflinks' assembled transcripts datasets to the visualization you created earlier in order to view the transcripts alongside the mapped reads, junctions, and reference genes. Can you find examples where Cufflinks assembled a complete or almost complete transcript?

Step 3a: Run [NGS: RNA Analysis >Cuffmerge on two datasets of assembled transcripts from Step 1 and use the UCSC genes annotation as the reference annotation. Cuffmerge produces a merged transcripts dataset that includes all transcript in both datasets. The merged set of transcripts are needed in subsequent steps. You can find information about this dataset by reading the Cuffmerge documentation.

Step 3b: Run [NGS: RNA Analysis >Cuffdiff on (a) the merged transcripts produced by Cuffmerge and (b) Tophat's accepted hits datasets for each dataset to perform differential expression analyses on the two samples. Cuffdiff produces quite a few output datasets; you'll want to browse the Cuffdiff documentation to get a sense of what they do

Look at the transcript FPKM tracking dataset at the top of your history. Find an entry for a novel isoform--class code 'j'--and an entry for an isoform that matches a reference isoform--class code '='. What is the nearest gene and transcription start site for each entry and what is each transcript's FPKM value? 

Look at transcript differential expression testing dataset (second from the top of your history) and find two transcripts that are differentially expressed. What are the p and q values for each differentially expressed transcript? 

On your own

How many novel transcripts/isoforms were found in Cuffdiff? (Hint: use [Filter and Sort >] Filter on the transcript FPKM dataset.) Which genes are these novel transcripts in and are they all expressed in both tissues?

What genes are differentially expressed? (Hint: use [Filter and Sort >] Filter on the gene differential expression testing dataset.) Which genes are expressed at greater than 10FPKM in Adrenal tissue? (Hint: use [Filter and Sort >] Filter on the gene FPKM dataset.)

[This is a difficult question to answer.] Create a workflow from your history (in history's options menu, select 'Extract Workflow'), import the full Adrenal and Brain datasets from the BodyMap data library (go to the Shared Data tab, then Data Libraries, then search for 'bodymap'), and run the workflow on the full datasets. How many differentially expressed genes do you find?

More Resources

A history with the exercise completed (coming soon!):

And a visualization of mapped/aligned reads, assembled transcripts, and gene annotation (coming soon!):

Full RNA-seq BodyMap datasets are available in the data library 'Illumina BodyMap 2.0' and available for use. Use the Shared Data --> Data Libraries link in the header above to find this library.

Tophat documentation

Cufflinks/compare/merge/diff documentation

Nature Protocols paper describing RNA-seq analysis using Tophat-Cuff* pipeline

iGenomes data