Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
, 8 (3), 251-9

Estimation of Errors in "Raw" DNA Sequences: A Validation Study

Comparative Study

Estimation of Errors in "Raw" DNA Sequences: A Validation Study

P Richterich. Genome Res.


As DNA sequencing is performed more and more in a mass-production-like manner, efficient quality control measures become increasingly important for process control, but so also does the ability to compare different methods and projects. One of the fundamental quality measures in sequencing projects is the position-specific error probability at all bases in each individual sequence. Accurate prediction of base-specific error rates from "raw" sequence data would allow immediate quality control as well as benchmarking different methods and projects while avoiding the inefficiencies and time delays associated with resequencing and assessments after "finishing" a sequence. The program PHRED provides base-specific quality scores that are logarythmically related to error probabilities. This study assessed the accuracy of PHRED's error-rate prediction by analyzing sequencing projects from six different large-scale sequencing laboratories. All projects used four-color fluorescent sequencing, but the sequencing methods used varied widely between the different projects. The results indicate that the error-rate predictions such as those given by PHRED can be highly accurate for a large variety of different sequencing methods as well as over a wide range of sequence quality.


Figure 1
Figure 1
Actual and predicted error rates in six different sequencing projects. Actual error rates and predicted error rates in 50-base windows over the length of the sequence reads, averaged over all reads that could be aligned to the consensus sequence by CROSS_MATCH, are shown. The numbers on the x-axis show the first base in a given 50-base window.
Figure 2
Figure 2
Actual and predicted error rates in different quality subsets of project B. Sequence reads were sorted by the number of bases with a predicted error rate of at most 0.1% (very high-quality bases), and assigned to quartiles, with quartile 1 corresponding to the highest numbers. Actual and predicted error rates for all sequences in each subset were calculated as in Fig. 1. Note that a number of sequence reads that had been rejected because of too low quality were added back to the data set for illustrative purposes, all of which are in quartile 4. These sequences were not included in the data sets used to generate Figs. 1 and 3 and Tables 1 and 3.
Figure 3
Figure 3
Actual frameshift and total error rates for projects A and B. To calculate frameshift error rates, only insertions and deletions were counted. Mismatch errors, which account for the vast majority of errors after base 150, were included only in the total error count. Note that project B (▴,▵) has a slightly similar or slightly higher total error rate compared to project A (•,○) but only about one-third as many insertions and deletions up to base 500. For both projects, the frameshift error rate in the raw data is <1 in 1000 for >300 bases, and ≤1 in 10,000 for >100 bases in project B.

Similar articles

  • Base-calling of Automated Sequencer Traces Using Phred. II. Error Probabilities
    B Ewing et al. Genome Res 8 (3), 186-94. PMID 9521922.
    Elimination of the data processing bottleneck in high-throughput sequencing will require both improved accuracy of data processing software and reliable measures of that …
  • Sequence Length and Error Analysis of Sequenase and Automated Taq Cycle Sequencing Methods
    BF Koop et al. Biotechniques 14 (3), 442-7. PMID 8457352.
    We have examined DNA sequence error as a function of length using both a manual method of performing reactions with Sequenase and an automated Taq cycle sequencing method …
  • Basecalling With LifeTrace
    D Walther et al. Genome Res 11 (5), 875-88. PMID 11337481.
    A pivotal step in electrophoresis sequencing is the conversion of the raw, continuous chromatogram data into the actual sequence of discrete nucleotides, a process referr …
  • Gene Identification Through Large-Scale EST Sequence Processing
    A Lindlöf. Appl Bioinformatics 2 (3), 123-9. PMID 15130797. - Review
    The technology of sequencing expressed sequence tags (ESTs) offers a relatively cheap alternative to whole genome sequencing and has become a valuable resource for gene d …
  • Mass-spectrometry DNA Sequencing
    JR Edwards et al. Mutat Res 573 (1-2), 3-12. PMID 15829234. - Review
    Matrix-assisted laser desorption ionization time-of-flight mass spectrometry (MALDI-TOF MS) has been explored widely for DNA sequencing. Compared to gel electrophoresis b …
See all similar articles

Cited by 28 PubMed Central articles

See all "Cited by" articles

Publication types

LinkOut - more resources