Mining statistically-solid k-mers for accurate NGS error correction
- PMID: 30598110
- PMCID: PMC6311904
- DOI: 10.1186/s12864-018-5272-y
Mining statistically-solid k-mers for accurate NGS error correction
Abstract
Background: NGS data contains many machine-induced errors. The most advanced methods for the error correction heavily depend on the selection of solid k-mers. A solid k-mer is a k-mer frequently occurring in NGS reads. The other k-mers are called weak k-mers. A solid k-mer does not likely contain errors, while a weak k-mer most likely contains errors. An intensively investigated problem is to find a good frequency cutoff f0 to balance the numbers of solid and weak k-mers. Once the cutoff is determined, a more challenging but less-studied problem is to: (i) remove a small subset of solid k-mers that are likely to contain errors, and (ii) add a small subset of weak k-mers, that are likely to contain no errors, into the remaining set of solid k-mers. Identification of these two subsets of k-mers can improve the correction performance.
Results: We propose to use a Gamma distribution to model the frequencies of erroneous k-mers and a mixture of Gaussian distributions to model correct k-mers, and combine them to determine f0. To identify the two special subsets of k-mers, we use the z-score of k-mers which measures the number of standard deviations a k-mer's frequency is from the mean. Then these statistically-solid k-mers are used to construct a Bloom filter for error correction. Our method is markedly superior to the state-of-art methods, tested on both real and synthetic NGS data sets.
Conclusion: The z-score is adequate to distinguish solid k-mers from weak k-mers, particularly useful for pinpointing out solid k-mers having very low frequency. Applying z-score on k-mer can markedly improve the error correction accuracy.
Keywords: Error correction; Next-generation sequencing; z-score.
Conflict of interest statement
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Figures
Similar articles
-
A comparative study of k-spectrum-based error correction methods for next-generation sequencing data analysis.Hum Genomics. 2016 Jul 25;10 Suppl 2(Suppl 2):20. doi: 10.1186/s40246-016-0068-0. Hum Genomics. 2016. PMID: 27461106 Free PMC article.
-
Iterative error correction of long sequencing reads maximizes accuracy and improves contig assembly.Brief Bioinform. 2017 Jan;18(1):1-8. doi: 10.1093/bib/bbw003. Epub 2016 Feb 10. Brief Bioinform. 2017. PMID: 26868358 Free PMC article.
-
SAKE: Strobemer-assisted k-mer extraction.PLoS One. 2023 Nov 29;18(11):e0294415. doi: 10.1371/journal.pone.0294415. eCollection 2023. PLoS One. 2023. PMID: 38019768 Free PMC article.
-
Athena: Automated Tuning of k-mer based Genomic Error Correction Algorithms using Language Models.Sci Rep. 2019 Nov 6;9(1):16157. doi: 10.1038/s41598-019-52196-4. Sci Rep. 2019. PMID: 31695060 Free PMC article.
-
EDAR: an efficient error detection and removal algorithm for next generation sequencing data.J Comput Biol. 2010 Nov;17(11):1549-60. doi: 10.1089/cmb.2010.0127. Epub 2010 Oct 25. J Comput Biol. 2010. PMID: 20973743
Cited by
-
Role of F-box E3-ubiquitin ligases in plant development and stress responses.Plant Cell Rep. 2023 Jul;42(7):1133-1146. doi: 10.1007/s00299-023-03023-8. Epub 2023 May 17. Plant Cell Rep. 2023. PMID: 37195503 Review.
-
Comprehensive investigation of long non-coding RNAs in an endophytic fungus Calcarisporium arbuscula NRRL 3705.Arch Microbiol. 2023 Mar 31;205(4):153. doi: 10.1007/s00203-023-03494-z. Arch Microbiol. 2023. PMID: 37000333
-
The third international hackathon for applying insights into large-scale genomic composition to use cases in a wide range of organisms.F1000Res. 2022 May 16;11:530. doi: 10.12688/f1000research.110194.1. eCollection 2022. F1000Res. 2022. PMID: 36262335 Free PMC article.
-
Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs with Cuttlefish 2.Genome Biol. 2022 Sep 8;23(1):190. doi: 10.1186/s13059-022-02743-6. Genome Biol. 2022. PMID: 36076275 Free PMC article.
-
Gene Mining and Flavour Metabolism Analyses of Wickerhamomyces anomalus Y-1 Isolated From a Chinese Liquor Fermentation Starter.Front Microbiol. 2022 May 2;13:891387. doi: 10.3389/fmicb.2022.891387. eCollection 2022. Front Microbiol. 2022. PMID: 35586860 Free PMC article.
References
-
- Alic AS, Ruzafa D, Dopazo J, Blanquer I. Objective review of de novo stand-alone error correction methods for ngs data. WIREs Comput Mol Sci. 2016;6:111–46. doi: 10.1002/wcms.1239. - DOI
MeSH terms
LinkOut - more resources
Full Text Sources
Miscellaneous
