This document is a live copy of supplementary materials for Galaxy's MAF (Multiple Alignment Format) manipulation tools. The latest version can be found at http://usegalaxy.org/u/dan/p/maf.
In addition to the text provided here, the UCSC Genome Bioinformatics group maintains a description of the MAF format here.
The multiple alignment format (MAF) has emerged as a de facto standard for storing and exchanging whole genome multiple alignments. Alignments stored in this format retain the sequence and genomic position information for aligning sequence ranges. As a convention in Galaxy, sequences are named according to the source species genome build and sequence identifier within that build (generally a chromosome, contig or scaffold); the genome build and sequence identifier are separated by a period. For example (Figure S1), the sequence of chromosome 21 from the March 2006 human genome assembly (known here as hg18) would be named “hg18.chr21”. Alignments are arranged in “blocks” separated by a blank line, where each block constitutes an individual set of sequence ranges (e.g. a single local alignment involving some set of species). These ranges need not be unique as a MAF set can contain overlapping blocks. In the MAF format, alignments to the “-“ strand are numbered relative to the reverse complement of the source sequence (unlike common formats for genome annotation like GFF and BED). Though often a roadblock to biologists trying to work with these files, this important difference in coordinate systems is resolved internally within this toolset and requires no effort or consideration on the part of users.
For the list of alignments currently available in the main public Galaxy instance see http://bitbucket.org/galaxy/galaxy-central/wiki/AvailableData.
For instructions on adding pre-cached alignments to a Galaxy instance see http://bitbucket.org/galaxy/galaxy-central/wiki/Config/ToolData/AddMAFs.
Several compression algorithms were considered and two were
implemented (bzip2 and LZO) within the bx-python library, in the end
compression based upon the LZO algorithm was determined to have a
desirable balance between CPU intensity and compression level.
After testing various analyses with the 28-way alignments it was found that the limiting factor for most analyses was bzip2 decompression. This is despite the indexed semi-random access methods mentioned within the primary manuscript and with aggressive caching of uncompressed data. While, the bzip2 algorithm provides excellent compression, it comes at the expense of being very CPU intensive both for compression and decompression. This makes it a great algorithm for archival storage, but for compressing "on-line" data this becomes a serious drawback.
However, the LZO family of algorithms have extremely fast decompression rates -- so fast that, with slow disks, reading LZO compressed data can be faster than reading uncompressed data. The compression ratio is still reasonable, resulting in files only slightly larger than produced by gzip. There is an open source front-end program lzop (http://www.lzop.org/), which is included with several software distributions. The file format it uses is based on compressing blocks of fixed sizes in the uncompressed stream, making semi-random access with caching much more straightforward than with bzip2.
In one example, extracting all coding exons on chr10 from the compressed 28-way alignment took one minute using lzo (level 7) compression, whereas the same attempt using bzip took 8.5 minutes.
Source code for this toolset is made available along with the main Galaxy distribution. The command line tools are located under the /tools/maf/ directory. The Galaxy distribution is available from http://getgalaxy.org using the Mercurial version control system (preferred) or by downloading a tarball (http://dist.g2.bx.psu.edu/).
|(A) A single MAF block is shown with each component of the block being described. (B) Three MAF blocks are displayed. There are a total of five species included in this small alignment set, but one species (dog: canFam2) is missing from the second block|
|In this representation of the Extract MAF blocks given a set of genomic intervals tool, a single genomic interval is found to overlap with three MAF blocks in a source alignment set. MAF blocks 1 and 3 extend beyond the boundaries of the provided genomic interval and are trimmed before being included in the tool output.|
|Here, the two output styles of the MAF to FASTA tool are illustrated, one which creates a one-to-one mapping of MAF blocks to FASTA blocks and another which creates a single concatenated multiple-species FASTA block, where species which are absent from a particular block have their sequence filled in with gap characters.|
|In this illustration of the Stitch MAF blocks given a set of genomic intervals tool, four MAF alignment blocks are stitched into a single FASTA alignment block composed of only those positions that exist in the genome of the provided intervals.|
|The Filter MAF blocks by Species tool allows users to remove undesired species from alignments. When species are removed from an alignment set, alignment columns that now contain only gaps are collapsed (excluded from the output).|
|Like the Filter MAF blocks by Species tool (figure 5), the Join MAF blocks by Species tool allows users to remove undesired species, but takes an additional step to combine genomically adjacent blocks together. When genome 3 is removed from the alignment set, two of the three alignment blocks are joined together, resulting in only two output alignment blocks.|
|The Filter MAF blocks by Size tool removes alignment blocks that fall outside of a specified size range. Here all blocks which have more than 5 or less than 4 alignments columns are removed.|
Examples of the use of this toolset can be found at http://usegalaxy.org/u/dan/p/maf-exercises.