Manipulation of FASTQ data with Galaxy

This document is a live copy of supplementary materials for Galaxy's FASTQ manipulation tools; a set of screencasts and the results of vetting the toolset against published test files are presented.

The proliferation of next generation sequencing technologies has created numerous data management and analysis issues. The most troubling of these issues stems from the lack of standardized sequencer output and tools. As even the de facto standard output, FASTQ (see Cock et al., 2009), comes in a number of format variants, preparing and quality checking produced sequencing data can be particularly troublesome. Galaxy contains set of tools that is able to handle all known FASTQ variants and is intended to simplify the first steps following data acquisition. These steps typically follow the workflow of 1) parsing sequencer output, 2) calculating and 3) visualizing summary statistics on quality scores and nucleotide distributions, 4) trimming reads if necessary, 5) filtering reads by quality score and other various manipulations. The FASTQ tools are found under the NGS: QC and manipulation tool section.

Screencasts

  1. Basic FASTQ manipulation: groomer, splitter and joiner, quality statistics and boxplot
  2. Advanced FASTQ manipulation: filtering, trimming, etc...

Typical Workflow of FASTQ Manipulation


FASTQ Workflow

Implementation

This toolset was implemented in Python and is distributed as part of the standard Galaxy distribution, available from http://getgalaxy.org. The specific sections of code are found in ${GALAXY_ROOT}/lib/galaxy_utils/sequence/ and ${GALAXY_ROOT}/tools/fastq/.

Test Results

The FASTQ Groomer tool, which shares a common parser with the entire toolset, was run against test files provided in (Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM. (2009) The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res.) and the results are presented here. Along with their description of the FASTQ format, Cock et al. provides a set of test files to ensure that tools working with this format are able to properly handle the intricacies of the different variants. The results in Galaxy compare favorably to those expected by Cock et al., with a deviation allowing some of the test files originally marked as invalid to be groomed within Galaxy (in particular, certain out-of-range / invalid quality score values do not prevent successful grooming; see Invalid FASTQ Files table).

The actions below shown as convert to were performed using the FASTQ Groomer tool within Galaxy; when converting, the input and output types are as indicated.

Valid FASTQ Files
File NameActionCock et al. behaviorGalaxy Behavior
longreads_original_sanger.fastqconvert to Sangerlongreads_as_sanger.fastqlongreads_as_sanger.fastq
longreads_original_sanger.fastqconvert to Illuminalongreads_as_illumina.fastqlongreads_as_illumina.fastq
longreads_original_sanger.fastqconvert to Solexalongreads_as_solexa.fastqlongreads_as_solexa.fastq
wrapping_original_sanger.fastqconvert to Sangerwrapping_as_sanger.fastqwrapping_as_sanger.fastq
wrapping_original_sanger.fastqconvert to Illuminawrapping_as_illumina.fastqwrapping_as_illumina.fastq
wrapping_original_sanger.fastqconvert to Solexawrapping_as_solexa.fastqwrapping_as_solexa.fastq
illumina_full_range_original_illumina.fastqconvert to Sangerillumina_full_range_as_sanger.fastqillumina_full_range_as_sanger.fastq
illumina_full_range_original_illumina.fastqconvert to Illuminaillumina_full_range_as_illumina.fastqillumina_full_range_as_illumina.fastq
illumina_full_range_original_illumina.fastqconvert to Solexaillumina_full_range_as_solexa.fastqillumina_full_range_as_solexa.fastq
sanger_full_range_original_sanger.fastqconvert to Sangersanger_full_range_as_sanger.fastqsanger_full_range_as_sanger.fastq
sanger_full_range_original_sanger.fastqconvert to Illuminasanger_full_range_as_illumina.fastqsanger_full_range_as_illumina.fastq
sanger_full_range_original_sanger.fastqconvert to Solexasanger_full_range_as_solexa.fastqsanger_full_range_as_solexa.fastq
solexa_full_range_original_solexa.fastqconvert to Sangersolexa_full_range_as_sanger.fastqsolexa_full_range_as_sanger.fastq
solexa_full_range_original_solexa.fastqconvert to Illuminasolexa_full_range_as_illumina.fastqsolexa_full_range_as_illumina.fastq
solexa_full_range_original_solexa.fastqconvert to Solexasolexa_full_range_as_solexa.fastqsolexa_full_range_as_solexa.fastq
misc_dna_original_sanger.fastqconvert to Sangermisc_dna_as_sanger.fastqmisc_dna_as_sanger.fastq
misc_dna_original_sanger.fastqconvert to Illuminamisc_dna_as_illumina.fastqmisc_dna_as_illumina.fastq
misc_dna_original_sanger.fastqconvert to Solexamisc_dna_as_solexa.fastqmisc_dna_as_solexa.fastq
misc_rna_original_sanger.fastqconvert to Sangermisc_rna_as_sanger.fastqmisc_rna_as_sanger.fastq
misc_rna_original_sanger.fastqconvert to Illuminamisc_rna_as_illumina.fastqmisc_rna_as_illumina.fastq
misc_rna_original_sanger.fastqconvert to Solexamisc_rna_as_solexa.fastqmisc_rna_as_solexa.fastq

The actions below shown as parse were performed using the FASTQ Groomer tool within Galaxy; when parsing, the input and output type were assumed to both be Sanger.

Invalid FASTQ Files
File NameActionCock et al. behaviorGalaxy behavior
error_diff_ids.fastqparseunable to parseunable to parse
error_double_qual.fastqparseunable to parseunable to parse
error_double_seq.fastqparseunable to parseunable to parse
error_long_qual.fastqparseunable to parseunable to parse
error_no_qual.fastqparseunable to parseunable to parse
error_qual_del.fastqparseunable to parsenon-printing / out of range ASCII character is groomed to highest allowed quality score
error_qual_escape.fastqparseunable to parsenon-printing / out of range ASCII character is groomed to lowest allowed quality score
error_qual_null.fastqparseunable to parsenon-printing / out of range ASCII character is groomed to lowest allowed quality score
error_qual_space.fastqparseunable to parseunable to parse
error_qual_tab.fastqparseunable to parseunable to parse
error_qual_unit_sep.fastqparseunable to parsenon-printing / out of range ASCII character is groomed to lowest allowed quality score
error_qual_vtab.fastqparseunable to parsenon-printing / out of range ASCII character is groomed to lowest allowed quality score
error_short_qual.fastqparseunable to parseunable to parse
error_spaces.fastqparseunable to parseunable to parse
error_tabs.fastqparseunable to parsetabs within sequence string are allowed to pass through groomed output and non-printing / out of range ASCII character is groomed to lowest allowed quality score
error_trunc_at_seq.fastqparseunable to parseunable to parse
error_trunc_at_plus.fastqparseunable to parseunable to parse
error_trunc_at_qual.fastqparseunable to parseunable to parse
error_trunc_in_title.fastqparseunable to parseunable to parse
error_trunc_in_seq.fastqparseunable to parseunable to parse
error_trunc_in_plus.fastqparseunable to parseunable to parse
error_trunc_in_qual.fastqparseunable to parseunable to parse