Raisins and Rabbit Turds: NGS Quality Control Using Galaxy
Training Day, Galaxy Community Conference 2014, Johns Hopkins University, Baltimore
9-11:30 am, Monday June 30th, in Salon A Room 303
0. How to Use
Example histories have test data suitable for exploring QA/QC methods, in practicals or others! Close duplicate of materials available during workshop.
- Import the practical tutorial histories below, one at a time.
- Review methods in slides or using "re-run" icons in history to learn how tools and QA/QC concepts/tools were applied to datasets
- Practice extracting and editing workflows (more help here: http://wiki.galaxyproject.org/Learn/AdvancedWorkflow)
- Use the History menu "Copy Datasets" function to move just the input datasets into a new history, then explore tools/methods/workflows
1. Introduction to QC tools
There are several "premade" tools that are useful to quickly look at data. This can be quite informative, but you need to know what the tool is doing and how assumptions made may differ from your data.
- KZ_R1 is the forward read from a failed Illumina sequencer run. It was a custom amplicon run.
- It should be possible to identify what cycle the sequencer had problems
- This was not released data, so the problems are obvious
- Some of the flagged problems are expected for this type of data
There are also many creative ways to use tools that are not labeled 'QC' to do QC tasks. Many of these tools will be familiar to UNIX users. Even without that familiarity the tools are very simple and yet when used and combined they can give you a very powerful way to examine and test your data.
- I got the .bed file from some guy in a van, real cheap. He said it was all the genes from hg19, real recent, just as good as from the store.
2. Data: Good, Bad, and …
3. Decoding Public Data
Public Archives contain data of variable content and quality. Before working with it in Galaxy, make sure you know what you have and that Galaxy knows, too.
- Sample: 75 sequences from the BodyMap2 study (Single-End Illumina)
- Protocol video walk-through: http://vimeo.com/galaxyproject/fastqprep
- Full Data source: accession "ERR030856" via tool "Get Data -> EBI-SRA"
- Second history is BONUS, the full SRA dataset w/ expanded QA and mapping comparisons
Using what we've learned, can you find the raisins?
- .fastq datasets
- .fasta reference genome
- .gff reference annotation