Galaxy Data Formats

Dataset missing?

If you have a dataset in your history that is not appearing in the drop-down selector for a tool, the most common reason is that it has the wrong format. Each Galaxy dataset has an associated file format recorded in its metadata, and tools will only list datasets from your history that have a format compatible with that particular tool. Of course some of these datasets might not actually contain relevant data, or even the correct columns needed by the tool, but filtering by format at least makes the list to select from a bit shorter.

Some of the formats are defined hierarchically, going from very general ones like Tabular (which includes any text file with tab-separated columns), to more restrictive sub-formats like Interval (where three of the columns must be the chromosome, start position, and end position), and on to even more specific ones such as BED that have additional requirements. So for example if a tool's required input format is Tabular, then all of your history items whose format is recorded as Tabular will be listed, along with those in all sub-formats that also qualify as Tabular (Interval, BED, GFF, etc.).

There are two usual methods for changing a dataset's format in Galaxy: if the file contents are already in the required format but the metadata is wrong (perhaps because the Auto-detect feature of the Upload File tool guessed it incorrectly), you can fix the metadata manually by clicking on the pencil icon beside that dataset in your history. Or, if the file contents really are in a different format, Galaxy provides a number of format conversion tools (e.g. in the Text Manipulation and Convert Formats categories). For instance, if the tool you want to run requires Tabular but your columns are delimited by spaces or commas, you can use the "Convert delimiters to TAB" tool under Text Manipulation to reformat your data. However if your files are in a completely unsupported format, then you need to convert them yourself before uploading.

Format Descriptions

AB1
AXT
BAM
BED
BedGraph
Binseq.zip
FASTA
FastqSolexa
FPED
gd_indivs
gd_ped
gd_sap
gd_snp
GFF
GFF3
GTF
HTML
Interval
LAV
LPED
MAF
MasterVar
PBED
pgSnp
PSL
SCF
SFF
Table
Tabular
Txtseq.zip
VCF
Wiggle custom track
Other text type

AB1

This is one of the ABIF family of binary sequence formats from Applied Biosystems Inc. Files should have a '.ab1' file extension. You must manually select this file format when uploading the file.

AXT

Used for pairwise alignment output from BLASTZ, after post-processing. Each alignment block contains three lines: a summary line and two sequence lines. Blocks are separated from one another by blank lines. The summary line contains chromosomal position and size information about the alignment, and consists of nine required fields. More information

BAM

A binary alignment file compressed in the BGZF format with a '.bam' file extension. SAM is the human-readable text version of this format.

Can be converted to:

SAM
NGS: SAM Tools → BAM-to-SAM
Pileup
NGS: SAM Tools → Generate pileup
Interval
First convert to Pileup as above, then use NGS: SAM Tools → Pileup-to-Interval

BED

also qualifies as Tabular
also qualifies as Interval

This tab-separated format describes a genomic interval, but has strict field specifications for use in genome browsers. BED files can have from 3 to 12 columns, but the order of the columns matters, and only the end ones can be omitted. Some groups of columns must be all present or all absent. As in Interval format (but unlike GFF and its relatives), the interval endpoints use a 0-based, half-open numbering system. Field specifications

Example:

chr22 1000 5000 cloneA 960 + 1000 5000 0 2 567,488, 0,3512
chr22 2000 6000 cloneB 900 - 2000 6000 0 2 433,399, 0,3601

Can be converted to:

GFF
Convert Formats → BED-to-GFF

BedGraph

also qualifies as Tabular
also qualifies as Interval
also qualifies as BED

BedGraph is a BED file with the name column being a float value that is displayed as a wiggle score in tracks. Unlike in Wiggle format, the exact value of this score can be retrieved after being loaded as a track.

Binseq.zip

A zipped archive consisting of binary sequence files in either AB1 or SCF format. All files in this archive must have the same file extension which is one of '.ab1' or '.scf'. You must manually select this file format when uploading the file.

FASTA

A sequence in FASTA format consists of a single-line description, followed by lines of sequence data. The first character of the description line is a greater-than ('>') symbol. All lines should be shorter than 80 characters.

>sequence1
atgcgtttgcgtgc
gtcggtttcgttgc
>sequence2
tttcgtgcgtatag
tggcgcggtga

Can be converted to:

Tabular
Convert Formats → FASTA-to-Tabular

FastqSolexa

FastqSolexa is the Illumina (Solexa) variant of the FASTQ format, which stores sequences and quality scores in a single file.

@seq1  
GACAGCTTGGTTTTTAGTGAGTTGTTCCTTTCTTT  
+seq1  
hhhhhhhhhhhhhhhhhhhhhhhhhhPW@hhhhhh  
@seq2  
GCAATGACGGCAGCAATAAACTCAACAGGTGCTGG  
+seq2  
hhhhhhhhhhhhhhYhhahhhhWhAhFhSIJGChO

@seq1
GAATTGATCAGGACATAGGACAACTGTAGGCACCAT
+seq1
40 40 40 40 35 40 40 40 25 40 40 26 40 9 33 11 40 35 17 40 40 33 40 7 9 15 3 22 15 30 11 17 9 4 9 4
@seq2
GAGTTCTCGTCGCCTGTAGGCACCATCAATCGTATG
+seq2
40 15 40 17 6 36 40 40 40 25 40 9 35 33 40 14 14 18 15 17 19 28 31 4 24 18 27 14 15 18 2 8 12 8 11 9

Can be converted to:

FASTA
NGS: QC and manipulation → Generic FASTQ manipulation → FASTQ to FASTA
Tabular
NGS: QC and manipulation → Generic FASTQ manipulation → FASTQ to Tabular

FPED

Also known as the FBAT format, for use with the FBAT program. It consists of a pedigree file and a phenotype file.

ind

This format is a tabular file with the first column being the column number (1 based) from the gd_snp file where the individual/group starts. The second column is the label from the metadata for the individual/group. The third is an alias or blank.

gd_sap

This is a tabular file describing single amino-acid polymorphisms (SAPs). You must manually select this file format when uploading the file.

gd_snp

This is a tabular file describing SNPs in individuals or populations. It contains the zero-based position of the SNP but not the range required by BED or interval so can not be used in Genomic Operations without adding an column for the end position. You must manually select this file format when uploading the file. Field specifications

GFF

also qualifies as Tabular

GFF is a tab-separated format somewhat similar to BED, but it has different columns and is more flexible. There are nine required fields. Note that unlike Interval and BED, GFF and its relatives (GFF3, GTF) use 1-based inclusive coordinates to specify genomic intervals.

Can be converted to:

BED
Convert Formats → GFF-to-BED

GFF3

also qualifies as Tabular

The GFF3 format addresses the most common extensions to GFF, while attempting to preserve compatibility with previous formats. Note that unlike Interval and BED, GFF and its relatives (GFF3, GTF) use 1-based inclusive coordinates to specify genomic intervals.

GTF

also qualifies as Tabular

GTF is a format for describing genes and other features associated with DNA, RNA, and protein sequences. It is a refinement to GFF that tightens the specification. Note that unlike Interval and BED, GFF and its relatives (GFF3, GTF) use 1-based inclusive coordinates to specify genomic intervals.

HTML

This format is an HTML web page. Click the eye icon next to the dataset to view it in your browser.

Interval

also qualifies as Tabular

This Galaxy format represents genomic intervals. It is tab-separated, but has the added requirement that three of the columns must be the chromosome name, start position, and end position, where the positions use a 0-based, half-open numbering system (see below). An optional strand column can also be specified, and an initial header row can be used to label the columns, which do not have to be in any special order. Arbitrary additional columns can also be present.

Required fields:

CHROM - The name of the chromosome (e.g. chr3, chrY, chr2_random) or contig (e.g. ctgY1).
START - The starting position of the feature in the chromosome or contig. The first base in a chromosome is numbered 0.
END - The ending position of the feature in the chromosome or contig. This base is not included in the feature. For example, the first 100 bases of a chromosome are described as START=0, END=100, and span the bases numbered 0-99.

Optional:

STRAND - Defines the strand, either '+' or '-'.
Header row

Example:

    #CHROM  START  END    STRAND  NAME  COMMENT
    chr1    10     100    +       exon  myExon
    chrX    1000   10050  -       gene  myGene

Can be converted to:

BED
The exact changes needed and tools to run will vary with what fields are in the Interval file and what type of BED you are converting to. In general you will likely use Text Manipulation → Compute, Cut, or Merge Columns.

LAV

LAV is the raw pairwise alignment format that is output by BLASTZ. The first line begins with #:lav.

LPED

This is the linkage pedigree format, which consists of separate MAP and PED files. Together these files describe SNPs; the map file contains the position and an identifier for the SNP, while the pedigree file has the alleles. To upload this format into Galaxy, do not use Auto-detect for the file format; instead select lped. You will then be given two sections for uploading files, one for the pedigree file and one for the map file. For more information, see linkage pedigree, MAP, and/or PED.

Can be converted to:

PBED
Automatic
FPED
Automatic

MAF

MAF is the multi-sequence alignment format that is output by TBA and Multiz. The first line begins with '##maf'. This word is followed by whitespace-separated "variable=value" pairs. There should be no whitespace surrounding the '='.

Can be converted to:

BED
Convert Formats → MAF to BED
Interval
Convert Formats → MAF to Interval
FASTA
Convert Formats → MAF to FASTA

MasterVar

MasterVar is a tab delimited text format with specified fields developed by the Complete Genomics life sciences company. Field specifications.

Can be converted to:

pgSnp
Convert Formats → MasterVar to pgSnp
gd_snp
Convert Formats → MasterVar to gd_snp

PBED

This is the binary version of the LPED format.

Can be converted to:

LPED
Automatic

pgSnp

This is the personal genome SNP format used by UCSC. It is a BED-like format with columns chosen for the specialized display in the browser for personal genomes. Field specifications. Galaxy treats it the same as an interval file.

PSL

PSL format is used for alignments returned by BLAT. It does not include any sequence.

SCF

This is a binary sequence format originally designed for the Staden sequence handling software package. Files should have a '.scf' file extension. You must manually select this file format when uploading the file. More information

SFF

This is a binary sequence format used by the Roche 454 GS FLX sequencing machine, and is documented on p. 528 of their software manual. Files should have a '.sff' file extension.

Can be converted to:

FASTA
Convert Formats → SFF converter
FASTQ
Convert Formats → SFF converter

Table

Text data separated into columns by something other than tabs.

Tabular (tab-delimited)

One or more columns of text data separated by tabs.

Can be converted to:

FASTA
Convert Formats → Tabular-to-FASTA
The Tabular file must have a title and sequence column.
FASTQ
NGS: QC and manipulation → Generic FASTQ manipulation → Tabular to FASTQ
Interval
If the Tabular file has a chromosome column (or is all on one chromosome) and has a position column, you can create an Interval file (e.g. for SNPs). If it is all on one chromosome, use Text Manipulation → Add column to add a CHROM column. If the given position is 1-based, use Text Manipulation → Compute with the position column minus 1 to get the START, and use the original given column for the END. If the given position is 0-based, use it as the START, and compute that plus 1 to get the END.

Txtseq.zip

A zipped archive consisting of flat text sequence files. All files in this archive must have the same file extension of '.txt'. You must manually select this file format when uploading the file.

VCF

Variant Call Format (VCF) is a tab delimited text file with specified fields. It was developed by the 1000 Genomes Project. Field specifications.

Can be converted to:

pgSnp
Convert Formats → VCF to pgSnp

Wiggle custom track

Wiggle tracks are typically used to display per-nucleotide scores in a genome browser. The Wiggle format for custom tracks is line-oriented, and the wiggle data is preceded by a track definition line that specifies which of three different types is being used. More information

Can be converted to:

Interval
Get Genomic Scores → Wiggle-to-Interval
As a second step this could be converted to 3- or 4-column BED, by removing extra columns using Text Manipulation → Cut columns from a table.

gd_ped

Similar to the linkage pedigree format (lped).

Other text type

Any text file.

Can be converted to:

Tabular
If the text has fields separated by spaces, commas, or some other delimiter, it can be converted to Tabular by using Text Manipulation → Convert delimiters to TAB.