Raisins and Rabbit Turds: NGS Quality Control Using Galaxy

Training Day, Galaxy Community Conference 2014, Johns Hopkins University, Baltimore

9-11:30 am, Monday June 30th, in Salon A Room 303

Event description and Slides

*
0. How to Use

Example histories have test data suitable for exploring QA/QC methods, in practicals or others! Close duplicate of materials available during workshop.
Notes:

  1. Import the practical tutorial histories below, one at a time. 
  2. Review methods in slides or using "re-run" icons in history to learn how tools and QA/QC concepts/tools were applied to datasets
  3. Practice extracting and editing workflows (more help here: http://wiki.galaxyproject.org/Learn/AdvancedWorkflow)
  4. Use the History menu "Copy Datasets" function to move just the input datasets into a new history, then explore tools/methods/workflows

*
1. Introduction to QC tools

There are several "premade" tools that are useful to quickly look at data. This can be quite informative, but you need to know what the tool is doing and how assumptions made may differ from your data.

Notes:

  1. KZ_R1 is the forward read from a failed Illumina sequencer run. It was a custom amplicon run.
  2. It should be possible to identify what cycle the sequencer had problems
  3. This was not released data, so the problems are obvious
  4. Some of the flagged problems are expected for this type of data

Galaxy History | Using FastQC

A set of data to run through import validation. This is a 'bad' fastq file from a failed sequencer run

There are also many creative ways to use tools that are not labeled 'QC' to do QC tasks. Many of these tools will be familiar to UNIX users. Even without that familiarity the tools are very simple and yet when used and combined they can give you a very powerful way to examine and test your data.

Notes:

  1. I got the .bed file from some guy in a van, real cheap. He said it was all the genes from hg19, real recent, just as good as from the store.

Galaxy History | Unix Tools for QC

Demo of unix tools to examine a file

*
2. Data: Good, Bad, and …

Galaxy History | RNA-seq Example

Start of a history for RNA-seq qc data. Should shoe different characteristics than genomic data.

*
3. Decoding Public Data

Public Archives contain data of variable content and quality. Before working with it in Galaxy, make sure you know what you have and that Galaxy knows, too.

Notes:

  1. Sample: 75 sequences from the BodyMap2 study (Single-End Illumina)
  2. Protocol video walk-through: http://vimeo.com/galaxyproject/fastqprep
  3. Full Data source: accession "ERR030856" via tool "Get Data -> EBI-SRA" 
  4. Second history is BONUS, the full SRA dataset w/ expanded QA and mapping comparisons

Galaxy History | FASTQ datatype QA

Source: SRA, BodyMap2, 75 seqs Type: Single-End Illumina Search in "Get Data -> EBI-SRA" by the study accession. Vimeo "Fastq Prep - Illumina" http://vimeo/galaxyproject/fastqprep

Galaxy History | FASTQ datatype QA- Full ERR030856

FastQC Reports align with data, but wouldn't help with all QC... Original v1.5 adaptors with 3' linker (as noted in SRA report ONLY! - Read Methods!!): ATCTCGTATGCCGTCTTCTGCTTG 1. Mapped with correct & incorrect quality score scale/datatype. 2. Mapped with linker unclipped and clipped (correct quality scaling only). Which made a difference? 3. Can the unidentified contam be characterized? Does it matter?

*
4. Challenge

Using what we've learned, can you find the raisins? 

Givens: 

  1. .fastq datasets
  2. .fasta reference genome
  3. .gff reference annotation

Galaxy History | Challenge: ChIP-Seq Custom Genome

* custom alignment * custom build (trackster vis w/ gff) * align yeast as-is * align yeast w/ mods * unmapped align human test to identify contam * megablast wgs/taxonmoy

*