bcftools query

Extracts fields from VCF/BCF file and prints them in user-defined format (Galaxy Version 1.15.1+galaxy4)

Tool Parameters

Please provide a value for this option.
* required
Restrict to
Query Options
Parameter 'format': Field requires a value
* required
Example: %CHROM\t%POS\t%REF\t%ALT{0}\n ( NOTE TAB: '\t' and new line character: '\n' )
Fields in your Format are separated by a TAB character: \t
Print "." for undefined tags

Additional Options

Send an email notification when the job completes.

Help

bcftools query

Extracts fields from VCF/BCF file and prints them in user-defined format

Format:

``%CHROM``          The CHROM column (similarly also other columns: POS, ID, REF, ALT, QUAL, FILTER)
``%INFO/TAG``       Any tag in the INFO column
``%TYPE``           Variant type (REF, SNP, MNP, INDEL, OTHER)
``%MASK``           Indicates presence of the site in other files (with multiple files)
``%TAG{INT}``       Curly brackets to subscript vectors (0-based)
``%FIRST_ALT``      Alias for %ALT{0}
``[]``              The brackets loop over all samples
``%GT``             Genotype (e.g. 0/1)
``%TBCSQ``          Translated FORMAT/BCSQ. See the csq command above for explanation and examples.
``%TGT``            Translated genotype (e.g. C/A)
``%IUPACGT``        Genotype translated to IUPAC ambiguity codes (e.g. M instead of C/A)
``%LINE``           Prints the whole line
``%SAMPLE``         Sample name
``%POS0``           POS in 0-based coordinates
``%END``            End position of the REF allele
``%END0``           End position of the REF allele in 0-based cordinates
``\n``              new line
``\t``              tab character

Examples:

# Print chromosome, position, ref allele and the first alternate allele
bcftools query -f '%CHROM  %POS  %REF  %ALT{0}\n' file.vcf.gz

# Similar to above, but use tabs instead of spaces, add sample name and genotype
bcftools query -f '%CHROM\t%POS\t%REF\t%ALT[\t%SAMPLE=%GT]\n' file.vcf.gz

# Print FORMAT/GT fields followed by FORMAT/GT fields
bcftools query -f 'GQ:[ %GQ] \t GT:[ %GT]\n' file.vcf

# Make a BED file: chr, pos (0-based), end pos (1-based), id
bcftools query -f'%CHROM\t%POS0\t%END\t%ID\n' file.bcf

Collapse

Controls how to treat records with duplicate positions and defines compatible records across multiple input files. Here by "compatible" we mean records which should be considered as identical by the tools. For example, when performing line intersections, the desire may be to consider as identical all sites with matching positions (bcftools isec -c all), or only sites with matching variant type (bcftools isec -c snps -c indels), or only sites with all alleles identical (bcftools isec -c none).

Flag value Result
none only records with identical REF and ALT alleles are compatible
some only records where some subset of ALT alleles match are compatible
all all records are compatible, regardless of whether the ALT alleles match or not. In the case of records with the same position, only the first wil lbe considered and appear on output.
snps any SNP records are compatible, regardless of whether the ALT alleles match or not. For duplicate positions, only the first SNP record will be considered and appear on output.
indels all indel records are compatible, regardless of whether the REF and ALT alleles match or not. For duplicate positions, only the first indel record will be considered and appear on output.
both abbreviation of "-c indels -c snps"
id only records with identical ID column are compatible. Supportedby bcftools merge only.

Region Selections

Regions can be specified in a VCF, BED, or tab-delimited file (the default). The columns of the tab-delimited file are: CHROM, POS, and, optionally, POS_TO, where positions are 1-based and inclusive. Uncompressed files are stored in memory, while bgzip-compressed and tabix-indexed region files are streamed. Note that sequence names must match exactly, "chr20" is not the same as "20". Also note that chromosome ordering in FILE will be respected, the VCF will be processed in the order in which chromosomes first appear in FILE. However, within chromosomes, the VCF will always be processed in ascending genomic coordinate order no matter what order they appear in FILE. Note that overlapping regions in FILE can result in duplicated out of order positions in the output. This option requires indexed VCF/BCF files.

Targets

Similar to regions, but the next position is accessed by streaming the whole VCF/BCF rather than using the tbi/csi index. Both regions and targets options can be applied simultaneously: regions uses the index to jump to a region and targets discards positions which are not in the targets. Unlike regions, targets can be prefixed with "^" to request logical complement. For example, "^X,Y,MT" indicates that sequences X, Y and MT should be skipped. Yet another difference between the two is that regions checks both start and end positions of indels, whereas targets checks start positions only.

For the bcftools call command, with the option -C alleles, third column of the targets file must be comma-separated list of alleles, starting with the reference allele. Note that the file must be compressed and index. Such a file can be easily created from a VCF using:

bcftools query -f'%CHROM\t%POS\t%REF,%ALT\n' file.vcf | bgzip -c > als.tsv.gz && tabix -s1 -b2 -e2 als.tsv.gz

Expressions

Valid expressions may contain:

  • numerical constants, string constants

    1, 1.0, 1e-4
    "String"
    
  • arithmetic operators

    +,*,-,/
    
  • comparison operators

    == (same as =), >, >=, <=, <, !=
    
  • regex operators "~" and its negation "!~"

    INFO/HAYSTACK ~ "needle"
    
  • parentheses

    (, )
    
  • logical operators

    && (same as &), ||,  |
    
  • INFO tags, FORMAT tags, column names

    INFO/DP or DP
    FORMAT/DV, FMT/DV, or DV
    FILTER, QUAL, ID, REF, ALT[0]
    
  • 1 (or 0) to test the presence (or absence) of a flag

    FlagA=1 && FlagB=0
    
  • "." to test missing values

    DP=".", DP!=".", ALT="."
    
  • missing genotypes can be matched regardless of phase and ploidy (".|.", "./.", ".") using this expression

    GT="."
    
  • TYPE for variant type in REF,ALT columns (indel,snp,mnp,ref,other)

    TYPE="indel" | TYPE="snp"
    
  • array subscripts, "*" for any field

    (DP4[0]+DP4[1])/(DP4[2]+DP4[3]) > 0.3
    DP4[*] == 0
    CSQ[*] ~ "missense_variant.*deleterious"
    
  • function on FORMAT tags (over samples) and INFO tags (over vector fields)

    MAX, MIN, AVG, SUM, STRLEN, ABS
    
  • variables calculated on the fly if not present: number of alternate alleles; number of samples; count of alternate alleles; minor allele count (similar to AC but is always smaller than 0.5); frequency of alternate alleles (AF=AC/AN); frequency of minor alleles (MAF=MAC/AN); number of alleles in called genotypes

    N_ALT, N_SAMPLES, AC, MAC, AF, MAF, AN
    

Notes:

  • String comparisons and regular expressions are case-insensitive
  • If the subscript "*" is used in regular expression search, the whole field is treated as one string. For example, the regex STR[*]~"B,C" will be true for the string vector INFO/STR=AB,CD.
  • Variables and function names are case-insensitive, but not tag names. For example, "qual" can be used instead of "QUAL", "strlen()" instead of "STRLEN()" , but not "dp" instead of "DP".

Examples:

MIN(DV)>5
MIN(DV/DP)>0.3
MIN(DP)>10 & MIN(DV)>3
FMT/DP>10  & FMT/GQ>10 .. both conditions must be satisfied within one sample
FMT/DP>10 && FMT/GQ>10 .. the conditions can be satisfied in different samples
QUAL>10 |  FMT/GQ>10   .. selects only GQ>10 samples
QUAL>10 || FMT/GQ>10   .. selects all samples at QUAL>10 sites
TYPE="snp" && QUAL>=10 && (DP4[2]+DP4[3] > 2)
MIN(DP)>35 && AVG(GQ)>50
ID=@file       .. selects lines with ID present in the file
ID!=@~/file    .. skip lines with ID present in the ~/file
MAF[0]<0.05    .. select rare variants at 5% cutoff

http://samtools.github.io/bcftools/bcftools.html#query

https://github.com/samtools/bcftools/wiki

Help Forum

There are no questions on the Help Forum about this tool.

Ask a new question

Unnamed history

Draggable