Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 8 (10), e75402
eCollection

Coval: Improving Alignment Quality and Variant Calling Accuracy for Next-Generation Sequencing Data

Affiliations

Coval: Improving Alignment Quality and Variant Calling Accuracy for Next-Generation Sequencing Data

Shunichi Kosugi et al. PLoS One.

Erratum in

  • PLoS One. 2014;9(1). doi:10.1371/annotation/cc88d2b5-36e8-441a-ab5f-58a9ed143d6b

Abstract

Accurate identification of DNA polymorphisms using next-generation sequencing technology is challenging because of a high rate of sequencing error and incorrect mapping of reads to reference genomes. Currently available short read aligners and DNA variant callers suffer from these problems. We developed the Coval software to improve the quality of short read alignments. Coval is designed to minimize the incidence of spurious alignment of short reads, by filtering mismatched reads that remained in alignments after local realignment and error correction of mismatched reads. The error correction is executed based on the base quality and allele frequency at the non-reference positions for an individual or pooled sample. We demonstrated the utility of Coval by applying it to simulated genomes and experimentally obtained short-read data of rice, nematode, and mouse. Moreover, we found an unexpectedly large number of incorrectly mapped reads in 'targeted' alignments, where the whole genome sequencing reads had been aligned to a local genomic segment, and showed that Coval effectively eliminated such spurious alignments. We conclude that Coval significantly improves the quality of short-read sequence alignments, thereby increasing the calling accuracy of currently available tools for SNP and indel identification. Coval is available at http://sourceforge.net/projects/coval105/.

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Improvement of SNP/indel calling accuracies of various DNA variant callers by Coval-Refine.
(A) SNP calling accuracy with or without Coval-Refine. (B) Indel calling accuracy with or without Coval-Refine. The simulated rice genome was aligned with reads of the real rice genome (experimental reads) using BWA. Alignment data were filtered (+, red striped and blue striped bars) or not filtered (–, light red and light blue bars) with the Coval-Refine component (Coval-Refine, error correction mode), and homozygous SNPs and indels were called using the indicated variant callers. The SNPs and indels extracted by all the callers were further filtered under the same conditions, as described in the text. True positive rate (TPR, the number of successfully called SNPs or indels divided with the number of SNPs or indels introduced into the simulated genome, followed by multiplying with 100) is shown with light red and red striped bars, and false positive rate (FPR, the number of wrongly called SNPs or indels divided with the number of the totally called SNPs or indels, followed by multiplying with 100) with light blue and blue striped bars. The GATK pileline was carried out with (GATK BQSR) or without (GATK) the base quality score recalibration. A variant quality score recalibration in the GATK pipeline was omitted because of its unsuitability for our data. Instead it was replaced by simple filtering: a minimum allele frequency of 0.8 and a minimum allelic read depth of 2 (see Materials and Methods for details).
Figure 2
Figure 2. Improvement by Coval-Refine of SNP/indel calling accuracy of variant calling tools for mouse alignment data.
(A) SNP calling accuracy with or without Coval-Refine. (B) Indel calling accuracy with or without Coval-Refine. A simulated mouse genome was aligned with real mouse read data using BWA. The alignments were filtered (+, striped bars) or not filtered (–, plain bars) with Coval-Refine. Homozygous SNPs and indels were called with the indicated variant callers under the same conditions as in Figure 1.
Figure 3
Figure 3. Improvement of targeted alignment by Coval-Refine.
Rice whole-genome sequencing reads (63 million 75 bp paired-end reads) were aligned to chromosome 10 (A and B) or a 1 Mb region of chromosome 10 (C and D) of the simulated rice genome. Snapshot views of the alignments, corresponding to positions 1,338,000 to 1,342,538, with (B and D) or without (A and C) the Coval-Refine tool (basic mode) are represented. The shown alignment views were obtained with an IGV 1.5 viewer . Shaded bars represent reads, and colored lines in bars non-reference bases. Blue arrowheads indicate true positive SNPs that had been introduced into the rice genome using the Coval-Simulate tool.
Figure 4
Figure 4. Improvement of SNP/indel calling accuracy by Coval-Refine in targeted alignment.
The whole chromosomes (All chr), chromosome 10 (Chr10), a 1 Mb fragment of chromosome 10 (Chr10-1M: positions 1000001 to 2000000 of Chr10) from the simulated rice genome were aligned with 75-bp paired-end reads sequenced from the whole rice genome using BWA. The alignments were filtered (+, bars in dark- and middle-red and in dark- and middle-blue) or not filtered (–, bars in light red and in light blue) with Coval-Refine in the basic mode. Two different filtering conditions of Coval-Refine for mismatch reads were applied; one is the default option for removing reads with three or more mismatches (middle-red and middle-blue bars), the other removing the second paired-end mate read when the first mate is filtered and removing a read pair that contained more than two total mismatches (dark red and dark blue bars). The mean coverage of read depth before and after (indicated with parentheses) the Coval-Refine treatment is indicated under the reference chromosome name. Homozygous SNPs and indels were called as in Figure 1. TPR and FPR for the called SNPs are shown with red and blue bars, respectively.

Similar articles

See all similar articles

Cited by 21 PubMed Central articles

See all "Cited by" articles

References

    1. Chan EY (2009) Next-generation sequencing methods: impact of sequencing accuracy on SNP discovery. Methods Mol Biol 578: 95–111. - PubMed
    1. Nakamura K, Oshima T, Morimoto T, Ikeda S, Yoshikawa H, et al. (2011) Sequence-specific error profile of Illumina sequencers. Nucleic Acids Res 39: e90. - PMC - PubMed
    1. Gilles A, Meglecz E, Pech N, Ferreira S, Malausa T, et al. (2011) Accuracy and quality assessment of 454 GS-FLX Titanium pyrosequencing. BMC Genomics 12: 245. - PMC - PubMed
    1. Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25: 1754–1760. - PMC - PubMed
    1. Li H, Ruan J, Durbin R (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome research 18: 1851–1858. - PMC - PubMed

Publication types

Grant support

This research was supported by the Program for Promotion of Basic Research Activities for Innovative Biosciences (PROBRAIN), the Ministry of Agriculture, Forestry, and Fisheries of Japan and Grant-in-aid for Scientific Research from the Ministry of Education, Cultures, Sports and Technology, Japan to RT (Grant-in-Aid for Scientific Research on Innovative Areas 23113009) and by JSPS KAKENHI Grant Number 22510217. LC, DM, and S. Kamoun were supported by the Gatsby Charitable Foundation. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Feedback