Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Jun 1;17(1):118.
doi: 10.1186/s13059-016-0973-5.

Vcfanno: fast, flexible annotation of genetic variants

Affiliations

Vcfanno: fast, flexible annotation of genetic variants

Brent S Pedersen et al. Genome Biol. .

Abstract

The integration of genome annotations is critical to the identification of genetic variants that are relevant to studies of disease or other traits. However, comprehensive variant annotation with diverse file formats is difficult with existing methods. Here we describe vcfanno, which flexibly extracts and summarizes attributes from multiple annotation files and integrates the annotations within the INFO column of the original VCF file. By leveraging a parallel "chromosome sweeping" algorithm, we demonstrate substantial performance gains by annotating ~85,000 variants per second with 50 attributes from 17 commonly used genome annotation resources. Vcfanno is available at https://github.com/brentp/vcfanno under the MIT license.

Keywords: Annotation; Genetic variation; Genome analysis; SNP; VCF; Variant; Variant prioritization.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Overview of the vcfanno workflow. An unannotated VCF (a) is sent to vcfanno (b) along with a configuration file that indicates the paths to the annotation files, the attributes to extract from each file, and the methods that should be used to describe or summarize the values extracted from those files. The new annotations in the resulting VCF (c) are shown in blue text with additional fields added to the INFO column
Fig. 2
Fig. 2
Overview of the chrom-sweep interval intersection algorithm. The chrom-sweep algorithm sweeps from left to right as it progresses along each chromosome. Green intervals from the query VCF in the first row are annotated by annotation files A (blue) and B (orange) in the second and third rows, respectively. The cache row indicates which intervals are currently in the cache at each point in the progression of the sweeping algorithm. Intervals enter the cache in order of their chromosomal start position. First A1 enters the cache followed by Q1. Since Q1 intersects A1, they are associated, as are Q1 and B1 when B1 enters the cache. Each time a new query interval enters the cache, any interval it does not intersect is ejected. Therefore, when Q2 enters the cache, Q1 and A1 are ejected. Since Q1 is a query interval, it is sent to be reported as output. Proceeding to the right, A2 and then Q3 enter the cache; the latter is a query interval and so the intervals that do not overlap it—B1, Q2, and A2—are ejected from the cache with the query interval, Q2, which is sent to the caller. Finally, as we reach the end of the incoming intervals, we clear out the final Q3 interval and finalize the output for this chromosome. EOF: End of File
Fig. 3
Fig. 3
Parallel sweeping algorithm. As in Fig. 2, we sweep across the chromosome from lower to higher positions (and left to right in the figure). The green query intervals are to be annotated with the two annotation files depicted with blue and orange intervals. The parallelization occurs in chunks of query intervals delineated by the black vertical lines. One process reads query intervals into memory until a maximum gap size to the next interval is reached (e.g., chunks 2, 4) or the number of intervals exceeds the chunk size threshold (e.g., chunks 1, 3). While a new set of query intervals accumulates, the first chunk, bounded to the right by the first vertical black line above, is sent for sweeping and a placeholder is put into a FIFO (first-in, first-out) queue, so that the output remains sorted even though other chunks may finish first. The annotation files are queried with regions based on the bounds of intervals in the query chunk. The queries then return streams of intervals and, finally, those streams are sent to the chrom-sweep algorithm in a new process. When it finishes, its placeholder can be pulled from the FIFO queue and the results are yielded for output
Fig. 4
Fig. 4
Parallelization efficiency. We show the efficiency of the parallelization strategy relative to one process on a whole-genome (1000 Genomes (1000G) in blue) and whole-exome (ExAC in green) dataset. In both cases, we are short of the ideal speedup (gray line) but we observe an approximately sevenfold speedup using 16 processors. Absolute times are provided in Additional file 3: Table S1
Fig. 5
Fig. 5
Effect of gap and chunk size on runtime for different data distributions. a Runtimes for annotating small variants on chromosome 20 for 1000 Genomes (1KG; 1,822,268 variants) and ExAC (256,057 variants) against 17 annotation files using four cores and different combinations of gap size and chunk size. b Data density for chromosome 20 of 1000 Genomes, ExAC, and the summation of the 17 annotation files
Fig. 6
Fig. 6
Using a “post-annotation” block to compute new annotations derived from existing annotations. a As described in the text, computing this post-annotation assumes that the “exac_total” and “exac_alts” fields have been extracted (via the AN and AC fields in the ExAC VCF) from the annotation file using standard annotation blocks. The AC field returns an array of alternative alleles for the variants in the annotation file that match a given variant in the query VCF. b In this post-annotation block, we calculate the alternative allele frequency in ExAC (“exac_aaf”) using “exac_alts [1]” as the numerator because in this example we calculate the alternative allele frequency based on the first alternative allele. c An example of a post-annotation block that calls a function (af_conf_int) in an external lua script (Additional file 2) to compute the lower bound of the allele frequency confidence interval based upon counts of the alternative allele and total observed alleles. d An example invocation of vcfanno that annotates a BGZIP-compressed VCF file (example.vcf.gz) using the configuration file described in panels (ac) (red), together with the lua script file (blue) containing the code underlying the af_conf_int function

Similar articles

Cited by

References

    1. Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, McVean G, Durbin R, 1000 Genomes Project Analysis Group. The variant call format and VCFtools. Bioinformatics. 2011;27:2156–8. - PMC - PubMed
    1. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297–303. - PMC - PubMed
    1. Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. arXiv [q-bio.GN]. 2012.
    1. Li H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics. 2014;30:2843–51. doi: 10.1093/bioinformatics/btu356. - DOI - PMC - PubMed
    1. Koren A, Polak P, Nemesh J, Michaelson JJ, Sebat J, Sunyaev SR, McCarroll SA. Differential relationship of DNA replication timing to different forms of human mutation and variation. Am J Hum Genet. 2012;91:1033–40. - PMC - PubMed

Publication types