None
Published Pages | galaxyproject | Using Galaxy 2012


June 2012

Using Galaxy to Perform Large-Scale Interactive Data Analysis: A live supplement

Jennifer Hillman-Jackson,1 Dave Clements,2 Daniel Blankenberg,1 James Taylor,2 Anton Nekrutenko,1 and the Galaxy Team1,2

1Penn State University, University Park, Pennsylvania

2Emory University, Atlanta, Georgia

Correspondence should be addressed to Jennifer Hillman-Jackson, Dave Clements, or Daniel Blankenberg.

How to use this document

This document is an interactive supplement to "Using Galaxy to Perform Large-Scale Interactive Data Analysis, Unit 10.5" in Current Protocols in Bioinformatics.  Every protocol, dataset, and workflow described in the paper is available from this page.  These supplementary items at Galaxy can be examined, copied, rerun and modified within the main public instance (here, at usegalaxy.org); migrated to a local or cloud Galaxy instance (getgalaxy.org); and/or downloaded, moved, copied and loaded into the UCSC Genome Browser, IGV, Ensembl Browser or other tool of interest. In brief, all tools and methods can be freely used by any method that you wish. All external datasets are public; please review each source reference for specific credits and usage requirements.

Citations should reference this publication, the core Galaxy publications listed in our wiki Citing Galaxy, and any 3rd party data or tools used as appropriate.

For each Protocol, the following is provided: 

  • Input datasets
  • Complete history
  • Workflows (if any)
  • Screencast Video tutorial (most)

Basic Protocol 1: Finding Human Coding Exons with Highest SNP Density

This protocol demonstrates a data analysis method to answer a specific biological question. Using external data accessible to and tools entirely contained within Galaxy, large datasets are compared and manipulated to complete the data reduction and then visualized. Extraction from the UCSC Table Browser, Interval Operations, Sorting, Text Manipulations, and a supplemental Galaxy Track Browser (Trackster) visualization are included.

Input Datasets

Both datasets are loaded at the start of the protocol.

Datasets from UCSC Table Browser

Source 'Exons hg19 chr22': RefSeq Genes. Note: RefSeq Genes is updated daily by UCSC and counts may differ from those in paper, shared datasets, histories, or screencast.

Source 'SNPs hg19 chr22': dbSNP132

History

A complete history for Basic Protocol 1, showing all input, intermediate, and output datasets, and a description of each step in the analysis.

Screencast Video Tutorial

"Using Galaxy: Finding Human Coding Exons with Highest SNP Density"

Protocol 1 step-by-step video tutorial that includes a supplemental Trackster walk-through for visualizing input and result datasets.

Basic Protocol 2: Loading Data and Understanding Datatypes

This protocol demonstrates multiple ways to import data into Galaxy. Details include how datasets are loaded, labeled, modified, and tracked by built-it and user accessible methods for different datatypes within a Galaxy history. Metadata attributes are explained and manipulated. Examples of how to obtain data from a Shared Data Library, an FTP Upload, and the UCSC Table Browser are covered step-by-step.

Input Datasets

These three datasets groupings are loaded individually at the start of each section of the protocol. Input datasets are available below as Galaxy objects but are also included in the Galaxy wiki (this wiki link is the original input data resource within the publication). 

Datasets from the ENCODE project:

These data are from the 'Transcription Factor Binding Sites by ChIP-seq from ENCODE/Stanford/Yale' mouse ChIP-SEQ experiment in the ENCODE project. Data were generated and analyzed by the labs of Michael Snyder at Stanford University and Sherman Weissman at Yale University.

Source 'Tags Chr19 ungroomed' and 'Control Chr19 ungroomed': Original files from ENCODE have been reduced to contain only data that corresponds to chromosome 19 and can be found at:
ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/encodeDCC/wgEncodeSydhTfbs/
     wgEncodeSydhTfbsMelCtcfDmso20IggyaleRawDataRep2.fastq.gz - 'Tags'
     wgEncodeSydhTfbsMelInputDmso20IggyaleRawData.fastq.gz* - 'Control'
    (*note: this is a correction from the
control source listed in the publication)

Dataset from the Mammalian Promotor Database (MPromDb):

MPromDb at the Wistar Instute is a "curated database that strives to annotate gene promoters identified from ChIP-Seq experiment results."  It is a public resource, but requires a login to download data. Please note that these data are restricted to non-commercial use.  We wish to thank the Davuluri Lab for allowing us to use this data.

Source "MPromDB Promoters chr19": This is a tab-delimited, custom format file for non-commercial use only. It is a reduced version of the full file from MPromDB that contains only promoters on mm9, chromosome 19. We ask you to honor MPromDb's use restrictions.

Dataset from UCSC Table Browser:

Source 'RefSeq Genes chr19': mm9, chr19 RefSeq Genes track from UCSC Table Browser.

History

A complete history for Basic Protocol 2, showing all input, intermediate, and output datasets, and a description of each step in the analysis.

Screencast Video Tutorial

"Using Galaxy: Loading Data and Understanding Datatypes"

Protocol 2 step-by-step video tutorial.

Basic Protocol 3: Calling Peaks for ChIP-seq Data

This protocol demonstrates how to perform a peak calling analysis based on an a publicly available input data source and the ChIP-seq analysis tool MACS. Supplemental datasets are available to correlate these peaks with RefSeq Genes sourced from UCSC and promoter regions sourced from MPromDb (see Protocol 2).

Input Datasets

The ENCODE datasets are loaded at the start of the protocol. The MPromDb and RefSeq dataset are provided as a supplemental to be loaded as needed. Both dataset groups can be obtained by following the methods in Basic Protocol 2, from a completed Protocol 2 history, or from this Supplemental Page as Galaxy Objects (below or grouped with Protocol 2 Objects).

Datasets from the ENCODE project:

Source 'Tags Chr19 ungroomed' and 'Control Chr19 ungroomed': These data are from the 'Transcription Factor Binding Sites by ChIP-seq from ENCODE/Stanford/Yale' mouse ChIP-SEQ experiment in the ENCODE project. (See Protocol 2 for source detail)

History

A complete history for Basic Protocol 3, showing all input, intermediate, and output datasets, and a description of each step in the analysis.

Supplemental protocol step for CASAVA 1.8+ FASTQ format input datasets

CASAVA 1.8+ FASTQ format data contain sequence names with whitespace within the identifier names.

Example (source: Wikipedia): @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG

If you are using data of this format with a workflow based on this Protocol (Protocol 3 example datasets are not of this format), the SAM format output datasets from Bowtie at the end of Step 4 will need to be converted to BAM format datasets before running MACS in Step 5. To do this, modify the protocol outlined in the publication by adding in a data transformation Step 4.h. after the existing operations in Step 4 and before starting Step 5 then replace the input datasets in Step 5.c. as follows:

     Add in a new protocol Step 4.h.

     4.h. Convert SAM output to BAM format
        i.   Click on 'NGS: SAM Tools' to expand the tool list.
        ii.  Click on 'SAM-to-BAM'
             Set:  'Choose the source for the reference list:' to 'Locally cached'
             Set:  'SAM File to Convert:' to 'Tags Chr19 SAM'
             Click Execute
        iii. Click on pencil icon for output dataset, rename as 'Tags Chr19 BAM', save.
        iv. Repeat for 'Control Chr19 SAM' dataset, naming output dataset to 'Control Chr19 BAM'

     Replace in protocol Step 5.c.

        i.  Replace "Tags Chr19 SAM" with "Tags Chr19 BAM"
        ii. Replace "Control Chr19 SAM" with "Control Chr19 BAM"

Screencast Video Tutorial

"Using Galaxy: Calling Peaks for ChIP-seq Data"

Protocol 3 step-by-step video tutorial.

Basic Protocol 4: Compare Datasets Using Genomic Coordinates

This protocol explores Galaxy's range of genomic interval operations. The coordinates of an 'Exons' dataset are evaluated against the coordinates of a 'Repeats' dataset by various tools and the results of these methods are compared and contrasted. Potential biological data analysis usage and interpretations are discussed.

Input Datasets

Both input datasets are loaded at the start of the protocol. 'Exons hg19 chr22' can be obtained by following the methods in Basic Protocol 1, from a completed Protocol 1 history, or from this Supplemental Page as a Galaxy Object (below). 'Repeats' is sourced from the UCSC Table Browser (hg19, chromosome 22) and can be obtained following the methods in this protocol or from this Supplemental Page as a Galaxy Object (below).

Dataset from UCSC Table Browser:

Source 'SNP Coding Exons chr22': dbSNP 132 (see Protocol 1 for source detail)
Source ''Repeats': RepeatMasker (hg19 chr22)

History

A complete history for Basic Protocol 4, showing all input, intermediate, and output datasets, and a description of each step in the analysis.

Basic Protocol 5: Working with Multiple Sequence Alignments

This protocol demonstrates several methods for obtaining, manipulating, and filtering data in Multiple Sequence Alignment format (MAFs). How to move a dataset between Histories and the procedure to use a Workflow are included.

Input Datasets

One input dataset can be obtained by following the methods in Basic Protocol 1, from a completed Protocol 1 history, or from this Supplemental Page as Galaxy Objects (below). The other input datasets are created during the execution of the protocol, including novel MAF extraction methods.

Datasets from UCSC Table Browser:

Source 'SNP Coding Exons chr22': dbSNP 132 (see Protocol 1 for source detail)

History

A complete history for Basic Protocol 5, showing all input, intermediate, and output datasets, and a description of each step in the analysis.

Correction

Please note that there is a typographical error in Part C: Step 8.d.iii of the manuscript.

The pattern to enter for the Select tool in this filtering step should read: ^rheMac2\.  (no quotes)

The Figure 10.5.36 in the manuscript has this pattern correct, as does the embedded history (above) and the video tutorial (below).

Workflow

This protocol creates a workflow to transform a file of sequence blocks into a standardized FASTA file. This workflow is available below as a Galaxy object and also as a Published Workflow.

Screencast Video Tutorial

"Using Galaxy: Working with Multiple Sequence Alignments"

Protocol 5 step-by-step video tutorial.

End