The Statistics of k-mers from a Sequence Undergoing a Simple Mutation Process Without Spurious Matches

Antonio Blanca; Robert S Harris; David Koslicki; Paul Medvedev

doi:10.1089/cmb.2021.0431

The Statistics of k-mers from a Sequence Undergoing a Simple Mutation Process Without Spurious Matches

J Comput Biol. 2022 Feb;29(2):155-168. doi: 10.1089/cmb.2021.0431. Epub 2022 Feb 1.

Authors

Antonio Blanca¹, Robert S Harris², David Koslicki^{1

2

3}, Paul Medvedev^{1

3

4}

Affiliations

¹ Department of Computer Science and Engineering, and The Pennsylvania State University, University Park, Pennsylvania, USA.
² Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, USA.
³ Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, Pennsylvania, USA.
⁴ Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania, USA.

PMID: 35108101
DOI: 10.1089/cmb.2021.0431

Abstract

k-mer-based methods are widely used in bioinformatics, but there are many gaps in our understanding of their statistical properties. Here, we consider the simple model where a sequence S (e.g., a genome or a read) undergoes a simple mutation process through which each nucleotide is mutated independently with some probability r, under the assumption that there are no spurious k-mer matches. How does this process affect the k-mers of S? We derive the expectation and variance of the number of mutated k-mers and of the number of islands (a maximal interval of mutated k-mers) and oceans (a maximal interval of nonmutated k-mers). We then derive hypothesis tests and confidence intervals (CIs) for r given an observed number of mutated k-mers, or, alternatively, given the Jaccard similarity (with or without MinHash). We demonstrate the usefulness of our results using a few select applications: obtaining a CI to supplement the Mash distance point estimate, filtering out reads during alignment by Minimap2, and rating long-read alignments to a de Bruijn graph by Jabba.

Keywords: Jaccard similarity; MinHash; confidence intervals; k-mers; mutation process; sketching.

Publication types

Evaluation Study
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Algorithms
Base Sequence
Computational Biology
Confidence Intervals
Genomics / statistics & numerical data
Humans
Models, Genetic
Mutation*
Sequence Alignment / statistics & numerical data
Sequence Analysis, DNA / statistics & numerical data*
Software