Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2014 Oct 15;30(20):2843-51.
doi: 10.1093/bioinformatics/btu356. Epub 2014 Jun 27.

Toward Better Understanding of Artifacts in Variant Calling From High-Coverage Samples

Affiliations
Free PMC article
Review

Toward Better Understanding of Artifacts in Variant Calling From High-Coverage Samples

Heng Li. Bioinformatics. .
Free PMC article

Abstract

Motivation: Whole-genome high-coverage sequencing has been widely used for personal and cancer genomics as well as in various research areas. However, in the lack of an unbiased whole-genome truth set, the global error rate of variant calls and the leading causal artifacts still remain unclear even given the great efforts in the evaluation of variant calling methods.

Results: We made 10 single nucleotide polymorphism and INDEL call sets with two read mappers and five variant callers, both on a haploid human genome and a diploid genome at a similar coverage. By investigating false heterozygous calls in the haploid genome, we identified the erroneous realignment in low-complexity regions and the incomplete reference genome with respect to the sample as the two major sources of errors, which press for continued improvements in these two areas. We estimated that the error rate of raw genotype calls is as high as 1 in 10-15 kb, but the error rate of post-filtered calls is reduced to 1 in 100-200 kb without significant compromise on the sensitivity.

Availability and implementation: BWA-MEM alignment and raw variant calls are available at http://bit.ly/1g8XqRt scripts and miscellaneous data at https://github.com/lh3/varcmp.

Contact: hengli@broadinstitute.org

Supplementary information: Supplementary data are available at Bioinformatics online.

Figures

Fig. 1.
Fig. 1.
Effect of filters. LC filter: not overlapping LCRs identified by the DUST algorithm. MD filter: read depth below d+3d, where d is the average read depth. Miscellaneous filter (misc) includes three filters: allele balance above 30%, variants supported by non-reference reads on both strands and Fisher strand P-value is >0.01. Filters are applied in the order of LC, MD and misc, with MD applied to variants passing LC, and misc applied to variants passing both LC and MD. For each call set, the total height of the bar gives the number of raw variant calls with the reported quality in VCF no <30. Note that the Y-axes are scaled differently
Fig. 2.
Fig. 2.
Relationship between CHM1 heterozygous call sets. Raw variant calls were filtered with variant quality no <30, allele balance >20%, Fisher strand P-value >0.001 and maximum read depth below d+4d, where d is the average read depth. (A) Relationship between heterozygous SNP call sets. Two SNPs are considered the same if they are at the same position. (B) Relationship between heterozygous INDEL call sets. Two filtered INDELs are said to be linked if the 3′ end of an INDEL is within 20 bp from the 5′ end of the other INDEL, or vice versa. An INDEL cluster is a connected component (not a clique) of linked INDELs. It is possible that in a cluster two INDELs are distant from each other but both overlap a third INDEL. Venn’s diagram shows the number of INDEL clusters falling in each category based on the sources of INDELs in each cluster. In total, 15% of SNPs and 91% of INDELs in the 3-way intersections overlap LCRs
Fig. 3.
Fig. 3.
Relationship between NA12878 heterozygous call sets
Fig. 4.
Fig. 4.
Example of misalignment around chr1:26608841 in CHM1. The truth allele is derived from local assembly. Three erroneous read alignments and their correct alignments are shown below it. Each of the three reads is an exact substring of the truth allele, but their alignments are different. The first read ‘errRead1’ is aligned without gaps, as the 3′ end of the read is a substring of the 18 bp deletion. Read ‘errRead2’ is aligned with a 6 bp insertion, as this alignment is better than having two long deletions. Read ‘errRead3’ is also aligned without gaps but with seven mismatches. It is possible for an aligner to find its correct alignment given a small gap extension penalty. On this example, Bowtie2 did not align any reads with gaps. BWA-MEM aligned four reads correctly. Except HaplotypeCaller which locally assembled reads, other callers all called multiple heterozygotes around this region
Fig. 5.
Fig. 5.
Effect of filters after removing variants in LCRs. Each filter is associated with one value. For each filter, the number of heterozygous SNPs called from CHM1 and NA12878 are counted accumulatively from the most stringent threshold on the filter value to the most relax threshold. Thresholds are chosen such that they approximately evenly divide variants into 100 bins. Each chosen threshold yields a point in the plot. An arrow points to a point on the MD curve when the corresponding read depth is right above d+4d, where d is the mean read depth across called variants
Fig. 6.
Fig. 6.
Relationship of CHM1 heterozygous SNPs called from mappings to different reference genomes. CHM1 reads were mapped with BWA-MEM. Autosomal SNPs were called with GATK HaplotypeCaller and passed the LC filter. Heterozygous calls from GRCh38 were lifted to GRCh37 with the liftOver tool from UCSC under the default setting

Similar articles

See all similar articles

Cited by 225 articles

See all "Cited by" articles

Publication types

Feedback