Published Pages | jeremy | Interactive RNA-seq with Trackster

Interactive RNA-seq analyses by visualization with Trackster

Jeremy Goecks1, The Galaxy Team2, Anton Nekrutenko3, and James Taylor1

Correspondence should be addressed to AN or JT.

An Interactive Supplement

This document is an interactive supplement for the paper describing Trackster, Galaxy’s visual analysis environment. This document describes how Trackster can be used for interactive, visual analysis of high-throughput transcriptome sequencing data. As an example, we have used Trackster to investigate the expression dynamics of the human XBP1 gene. XBP1 plays a prominent role in the mammalian unfolded protein response [1,2,3] yet the mechanisms of its expression are not well understood. Embedded Galaxy objects and links to Galaxy objects throughout the document enable readers to understand every aspect of the analysis and reproduce parts or all of the analysis. All Trackster and Galaxy functionality is available using any modern Web browser.

Real-time Visualization for Assembling and Quantifying Transcripts

To investigate XBP1 isoform expression, we analyzed RNA-seq data from the Illumina BodyMap 2.0 project, which includes data from 16 different human tissues (ENA:ERP000546). The data includes two RNA-seq datasets for each tissue, with each dataset 15-25GB in size and containing ~80M reads. We began by using the spliced-read mapper Tophat [4] (version 1.4.0) to map RNA-seq reads to the genome and the Cufflinks tool suite [5] (v 1.3.0) to assemble the mapped reads into isoforms and quantify isoform expression.

However, using default parameter settings for Tophat and Cufflinks did not work well. XBP1 has two coding isoforms, XBP1u (u for unspliced) and XBP1s (s for spliced); these isoforms are quite similar because XBP1u is non-canonically spliced to remove a small, 26bp intron to produce XBP1s. Tophat failed to identify reads spanning the small intron that differentiates XBP1u from XBP1s. Cufflinks, due to very similar isoforms and low expression levels of XBP1 isoforms, often did not assemble all transcripts indicated by the mapped reads. To improve results from Tophat and Cufflinks, we modified each tool's settings.

Tophat's settings were changed to detect small, known introns so that XBP1s could be assembled. A de novo approach to RNA-seq mapping does not yield the unique intron of XBP1s due to its small size—26bp—and the relatively large read lengths as well as the larger exon that it lies within. To find splice junctions indicative of this unique intron, Tophat was run using a gene annotation file that included XBP1 gene structure and parameters were set so that Tophat reported introns of size 20bp or larger.

Improving Cufflinks isoform assembly and quantification was less straightforward because Cufflinks output is highly parameter dependent, and it was not possible to known in advance what settings yielded the best assembly for a particular dataset or whether a single set of parameters would suffice for all assemblies. To generate and compare Cufflinks assemblies built using different parameters, we used Trackster.

Trackster provides a tool integration framework that enables Cufflinks and other compatible genomic tools to be run within the visualization environment. Running a tool in Trackster—changing parameter settings, executing the tool, and visualizing tool output—can typically be done in about a minute. To reduce tool execution, Trackster runs a tool only on the data in the visible genomic region. Trackster automatically visualizes tool output so that it easy to compare tool outputs created using different parameter values. By enabling tools to run quickly and repeatedly using different parameter values, Trackster makes it possible to quickly explore a tool's parameter space and identify the parameter values that yield the best result.

Using Trackster, we reran Cufflinks with different parameter values in order to find values that yielded the best assembly for XBP1 isoforms. Because our focus is XBP1, Cufflinks assemblies were run only for the region 29189394-29196770 on chromosome 22, which includes XBP1. 25-50k reads mapped to this region, and running Cufflinks on these reads took about a minute in Trackster. Running Cufflinks on a complete dataset required about 12 hours. Hence, we were able to significantly reduce computing time and overall analysis time by using Trackster rather than running Cufflinks on the complete dataset.

The visualization below shows Cufflinks assemblies for one RNA-seq dataset from Adrenal tissue; below the assemblies are the mapped reads and reference annotation. Multiple assemblies were created by changing settings for the parameters minimum isoform fraction and pre mRNA fraction. The top track is the initial assembly, and the next two tracks were generated using Cufflinks by lowering values the minimum isoform fraction and pre mRNA fraction. These assemblies are significantly better than the initial one, although lowering the pre mRNA fraction too far was problematic (see the fourth assembly.


We found that, for multiple datasets and different tissues, lowering both the minimum isoform fraction  produced good assemblies for XBP1 isoforms for most datasets and hence used a low value for all Cufflinks assemblies. 

Dynamically Filtering Assembly Artifacts

Assembly artifacts are common in Cufflinks’ assemblies, and filtering them out can improve the quantification results. To find suitable filtering criteria for removing assembly artifacts, we used Trackster’s dynamic filtering. Dynamic filters are created automatically by Trackster based on a dataset’s attributes. As a filter is changed, data not matching the filter is hidden in real time. Filtering can be done for a single dataset or across multiple datasets simultaneously. Filters can also be combined to encode attribute value in data display.

Cufflinks’ assembly artifacts often have very low expression, so filtering on isoform expression will remove many assembly artifacts. Assembled Cufflinks transcripts have numerous attributes, including Score—a measure of relative isoform abundance and FPKM—an absolute measure of isoform abundance. Using Trackster’s dynamic filters, we experimented with different filtering criteria for these attributes across assemblies for multiple tissues. First, to get a sense of each transcript’s Score and FPKM value, we assigned a transcript’s transparency to denote its Score and its height to denote its FPKM value. This made is easy to identify transcripts with low expression. Next, filters for Score and FPKM were used to remove as many artifacts as possible while retaining putative transcripts. We eventually determined that filtering with a Score less than 224 or FPKM less than 3 removed a majority of artifacts. Without Trackster’s dynamic filtering, we would have had to repeatedly choose filtering criteria, filter transcripts, and then visualize the filtered transcripts. This would have been a tedious process

The visualization below shows the assembled isoforms for one dataset for each of four different tissues, as well as dynamic filters that can be used to explore the data. In the visualization, filters are set to remove assembly artifacts, and Score is encoded as transcript height.

Putting it all Together: A Workflow for XBP1 Expression Analysis

Based on experimentation in Trackster for transcript assembly and artifact removal, we developed this workflow to assemble and quantify XBP1 isoform expression:

First, all reads are mapped genome-wide using Tophat. Next, transcripts are assembled genome-wide for each dataset using Cufflinks and parameter settings derived from analyses performed in Trackster. Cufflinks' assembled transcripts from both datasets are filtered using criteria developed using Trackster’s dynamic filters, and the filtered transcripts are combined into a single set of final transcripts using Cuffcompare. Finally, Cuffdiff is used to quantify isoform expression using mapped reads from all datasets, and the Cuffdiff output is filtered to obtain results for XBP1.

Using this workflow, we analyzed all datasets for 16 different human tissues: adipose, adrenal, brain, breast, colon, heart, kidney, liver, lung, lymph node, ovary, prostate, skeletal muscle, testes, thyroid, and white blood cells. Table 1 shows XBP1 isoform expression levels for all tissues for all known isoforms.

Table 1. XBP1 isoform expression values and 95% confidence intervals. All values in FPKM.
AdiposeAdrenalBrainBreastColonHeartKidneyLiverLungLymph NodeOvaryProstateSkeletal MuscleTestesThyroidWhite Blood Cells
XBP1u24.5 (CI: 22.0-27.0)84.7 (77.6-91.7)14.5 (11.9-17.0)69.4 (63.0-75.8)46.1 (41.0-51.1)39.9 (34.8-45.0)121.1 (111.2-131.0)287.1 (253.8-320.3)89.2 (82.2-96.3)325.7 (300.4-350.9)74.6 (66.9-82.4)29.9 (27.4-32.4)60.4 (55.0-65.8)22.6 (20.5-24.8)49.0 (44.9-53.1)97.0 (88.9-105.1)
XBP1s11.7 (10.1-13.3)15.5 (13.3-17.7)1.2 (0.8-1.7)12.7 (10.4-14.9)23.9 (20.8-27.0)14.1 (11.2-16.9)10.3 (8.7-11.8)25.3 (20.7-29.8)40.9 (37.2-44.7)58.5 (53.3-63.7)21.2 (18.0-24.4)Not enough data20.1 (17.6-22.5)12.3 (10.8-13.8)7.3 (5.9-8.8)4.7 (3.4-6.0)
Non-coding Isoform (uc003aec.2)6.9 (6.0-7.9)1.5 (1.0-2.1)

The known XBP1 coding isoforms, XBP1u and XBP1s, are ubiquitously expressed in all human tissues. However, both XBP1u and XBP1s are expressed at very low levels compared to the average isoform expression level across all tissues. XBP1 isoforms are expressed at the level of 1-350 FPKM, but the average FPKM across tissues is 1000-3000 FPKM. XBP1u is expressed significantly more highly than XBP1s in this set of tissues. The ubiquitous expression of XBP1s indicates continuous background activity of IRE1. An additional known non-coding isoform—UCSC id uc003aec.2—is present in some tissues, though expression levels are low enough that experimental verification is needed.

Visualizing Translation of XBP1 using a Custom Browser

There is biochemical evidence that translation of XBP1u is paused so that a cell can quickly produce the XBP1s, an active transcription factor in the unfolded protein response [6]. To identify the translational pause point of XBP1 in human, we analyzed ribosome profiling data from the endoplasmic reticulum (ER) and cytoplasm of HEK293 cells (ENA:SRP007963). This data consists of ribosome protected fragments (RPFs)—fragments from mRNA contained within ribosomes—sequenced using the SOLiD 4 system; for each sample, ~50 millions 35-base reads were obtained. We mapped each set of RPFs to the XBP1 transcriptome. Read positions were adjusted by 15bp downstream so that each read's start mapped to ribosomes' A site [7]. Here is the workflow used:

There are two findings from this analysis. First, XBP1u translation is localized to the ER. Only a handful of RPFs in the cytosol dataset mapped to XBP1, but the ER dataset yielded ~375 mapped reads, all in XBP1u's open reading frame. Second, based on mapped fragments from the ER, XBP1's pause point likely occurs at the final proline in the translated polypeptide at position 258 as was shown in mice [8].

These results can be visualized in a custom Trackster browser. The visualization below shows a custom visualization of the XBP1 transcriptome with annotation, mapped ER RFPs, and 5' read coverage that make the pause point clear:

Summary

Trackster is a visual analysis environment that is integrated into Galaxy. We have used Trackster’s interactive visualization capabilities to analyze high-throughput transcriptome sequencing data. Using Trackster, Cufflinks was repeatedly run to identify the parameters that yielded that best transcript assembly, and dynamic filtering was used to identify criteria for removing assembly artifacts. Trackster provides a general framework for running tools and dynamically filtering data. Many other Galaxy tools can be run in Trackster, including genomic operations tools (intersect, subtract) and SNP finding tools. Dynamic filtering is available for many different data types, including BED, GFF/GTF, and BAM. Visualizations embedded in this page demonstrate how Trackster can be used for collaborative analysis. All shared, published, and embedded visualizations are fully functional. All Trackster functionality requires only a Web browser to use.

Materials

Datasets used in this study are shared in the data library 'Illumina BodyMap 2.0' and available for use. Use the Shared Data --> Data Libraries link above to find this library.

Here are the transcriptome analysis histories for each tissue: History 'BodyMap Adipose' , History 'BodyMap Adrenal' , History 'BodyMap Brain' , History 'BodyMap Breast' , History 'BodyMap Colon' , History 'BodyMap Heart' , History 'BodyMap Kidney' , History 'BodyMap Liver' , History 'BodyMap Lung' , History 'BodyMap Lymph Node' , History 'BodyMap Ovary' , History 'BodyMap Prostate' , History 'BodyMap Skeletal Muscle' , History 'BodyMap Testes' , History 'BodyMap Thyroid' , History 'BodyMap White Blood Cells'

Here is the analysis history for the translational pause point analysis: History 'XBP1 Translational Pause Point Analysis'

References

[1] Yoshida, H., Matsui, T., Yamamoto, A., Okada, T. & Mori, K. XBP1 mRNA is induced by ATF6 and spliced by IRE1 in response to ER stress to produce a highly active transcription factor. Cell 107, 881-891 (2001).

[2] Uemura, A., Oku, M., Mori, K. & Yoshida, H. Unconventional splicing of XBP1 mRNA occurs in the cytoplasm during the mammalian unfolded protein response. J Cell Sci 122, 2877-2886 (2009).

[3] Walter, P. & Ron, D. The unfolded protein response: from stress pathway to homeostatic regulation. Science 334, 1081-1086 (2011).

[4] Trapnell, C., Pachter, L. & Salzberg, S.L. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25, 1105-1111 (2009).

[5] Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotech 28, 511-515 (2010).

[6] Ron, D. & Ito, K. Cell biology. A translational pause to localize. Science 331, 543-544 (2011).

[7] Ingolia, N.T., Lareau, L.F. & Weissman, J.S. Ribosome profiling of mouse embryonic stem cells reveals the complexity and dynamics of mammalian proteomes. Cell 147, 789-802 (2011).

[8] Yanagitani, K., Kimata, Y., Kadokura, H. & Kohno, K. Translational pausing ensures membrane targeting and cytoplasmic splicing of XBP1u mRNA. Science 331, 586-589 (2011).

Author Details

1Departments of Biology and Math & Computer Science, Emory University, USA

2http://galaxyproject.org; The Galaxy Team is Enis Afgan, Guru Ananda, Dannon Baker, Dan Blankenberg, Nate Coraor, Jeremy Goecks, Jennifer Jackson, Greg Von Kuster, Ross Lazarus, Anton Nekrutenko, and James Taylor

3Center for Comparative Genomics and Bioinformatics, Penn State University, USA