Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Dec 31;19(Suppl 10):912.
doi: 10.1186/s12864-018-5272-y.

Mining statistically-solid k-mers for accurate NGS error correction

Affiliations
Free PMC article

Mining statistically-solid k-mers for accurate NGS error correction

Liang Zhao et al. BMC Genomics. .
Free PMC article

Abstract

Background: NGS data contains many machine-induced errors. The most advanced methods for the error correction heavily depend on the selection of solid k-mers. A solid k-mer is a k-mer frequently occurring in NGS reads. The other k-mers are called weak k-mers. A solid k-mer does not likely contain errors, while a weak k-mer most likely contains errors. An intensively investigated problem is to find a good frequency cutoff f0 to balance the numbers of solid and weak k-mers. Once the cutoff is determined, a more challenging but less-studied problem is to: (i) remove a small subset of solid k-mers that are likely to contain errors, and (ii) add a small subset of weak k-mers, that are likely to contain no errors, into the remaining set of solid k-mers. Identification of these two subsets of k-mers can improve the correction performance.

Results: We propose to use a Gamma distribution to model the frequencies of erroneous k-mers and a mixture of Gaussian distributions to model correct k-mers, and combine them to determine f0. To identify the two special subsets of k-mers, we use the z-score of k-mers which measures the number of standard deviations a k-mer's frequency is from the mean. Then these statistically-solid k-mers are used to construct a Bloom filter for error correction. Our method is markedly superior to the state-of-art methods, tested on both real and synthetic NGS data sets.

Conclusion: The z-score is adequate to distinguish solid k-mers from weak k-mers, particularly useful for pinpointing out solid k-mers having very low frequency. Applying z-score on k-mer can markedly improve the error correction accuracy.

Keywords: Error correction; Next-generation sequencing; z-score.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
Frequency distribution of both error-free and error-containing k-mers for a NGS data set. The frequency distribution of erroneous k-mers is represented by the dash orange line, while the distribution of the correct ones is shown as the dash sky-blue line. The solid black line is the distribution of all the k-mers. The α-labeled area is the proportion of correct k-mers having frequency less than f0, while the β-labeled area is the proportion of erroneous k-mers having frequency greater than f0
Fig. 2
Fig. 2
Illustration of the forward and backward search to correct sequencing errors. The forward search starts from the first k-mer to the last k-mer. At each step the last base of the k-mer is substituted by its alternatives to check the solidity. Inversely, the backward search starts from the last k-mer to the first k-mer. On the contrary to the forward search, the first base of the k-mers are altered other than the last one
Fig. 3
Fig. 3
A relation between k-mer frequency and GC-content. The bottom left panel shows the smoothed scatter plot between k-mer frequency and GC-content, the top left is the distribution of k-mer frequency, and the bottom right is the distribution of GC-content. It is clear that GC-content k-mers have relatively low frequency. The data shown in this example is obtained from the H. chromosome 14 with k-mer size of 25
Fig. 4
Fig. 4
A relation between z-score and k-mer frequency. The level of shade represents the density of the distribution. The darker the color is, the more k-mers are presented. The frequencies of the k-mers highlighted in the red box are less than nine, which are very likely to be treated as weak for all existing k-mer based approaches. However, the very high z-score reflects that they should be treated as solid k-mers. The data shown here is obtained from B. impatiens with k-mer size of 25
Fig. 5
Fig. 5
The proportion of k-mers refined by z-score. The refinements come from two folds: weak k-mers having high z-score (moved to the solid k-mer set), and solid k-mers having low z-score (excluded from the solid k-mer set)
Fig. 6
Fig. 6
Memory saving analysis on the six data sets. The x-axis shows the memory saving ratio between the size of real memory allocation and raw input, while the y-axis shows how much proportion of an input held by a bit vector

Similar articles

Cited by

References

    1. Alic AS, Ruzafa D, Dopazo J, Blanquer I. Objective review of de novo stand-alone error correction methods for ngs data. WIREs Comput Mol Sci. 2016;6:111–46. doi: 10.1002/wcms.1239. - DOI
    1. The 1000 Genomes Project Consortium: A map of human genome variation from population-scale sequencing. Nature. 2010; 467:1061–73. - PMC - PubMed
    1. Kelley DR, Schatz MC, Salzberg SL. Quake: Quality-aware detection and correction of sequencing errors. Genome Biol. 2010;11(11):116. doi: 10.1186/gb-2010-11-11-r116. - DOI - PMC - PubMed
    1. Hackl T, Hedrich R, Schultz J, Förster F. proovread: large-scale high-accuracy pacbio correction through iterative short read consensus. Bioinformatics. 2014;30(21):3004–11. doi: 10.1093/bioinformatics/btu392. - DOI - PMC - PubMed
    1. Goodwin S, Gurtowski J, Ethe-Sayers S, Deshpande P, Schatz MC, McCombie WR. Oxford nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome. Genome Res. 2015;25(11):1750–1756. doi: 10.1101/gr.191395.115. - DOI - PMC - PubMed

LinkOut - more resources