Quality score based identification and correction of pyrosequencing errors

PLoS One. 2013 Sep 5;8(9):e73015. doi: 10.1371/journal.pone.0073015. eCollection 2013.

Abstract

Massively-parallel DNA sequencing using the 454/pyrosequencing platform allows in-depth probing of diverse sequence populations, such as within an HIV-1 infected individual. Analysis of this sequence data, however, remains challenging due to the shorter read lengths relative to that obtained by Sanger sequencing as well as errors introduced during DNA template amplification and during pyrosequencing. The ability to distinguish real variation from pyrosequencing errors with high sensitivity and specificity is crucial to interpreting sequence data. We introduce a new algorithm, CorQ (Correction through Quality), which utilizes the inherent base quality in a sequence-specific context to correct for homopolymer and non-homopolymer insertion and deletion (indel) errors. CorQ also takes uneven read mapping into account for correcting pyrosequencing miscall errors and it identifies and corrects carry forward errors. We tested the ability of CorQ to correctly call SNPs on a set of pyrosequences derived from ten viral genomes from an HIV-1 infected individual, as well as on six simulated pyrosequencing datasets generated using non-zero error rates to emulate errors introduced by PCR. When combined with the AmpliconNoise error correction method developed to remove ambiguities in signal intensities, we attained a 97% reduction in indel errors, a 98% reduction in carry forward errors, and >97% specificity of SNP detection. When compared to four other error correction methods, AmpliconNoise+CorQ performed at equal or higher SNP identification specificity, but the sensitivity of SNP detection was consistently higher (>98%) than other methods tested. This combined procedure will therefore permit examination of complex genetic populations with improved accuracy.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Algorithms
  • Computational Biology / methods
  • Genome, Viral
  • HIV-1 / genetics
  • High-Throughput Nucleotide Sequencing / methods
  • High-Throughput Nucleotide Sequencing / standards*
  • Humans
  • Polymorphism, Single Nucleotide
  • Quality Control
  • Sensitivity and Specificity
  • Sequence Analysis, DNA / methods