Set-Min Sketch: A Probabilistic Map for Power-Law Distributions with Application to k-Mer Annotation

J Comput Biol. 2022 Feb;29(2):140-154. doi: 10.1089/cmb.2021.0429. Epub 2022 Jan 18.

Abstract

k-mer counts are important features used by many bioinformatics pipelines. Existing k-mer counting methods focus on optimizing either time or memory usage, producing in output very large count tables explicitly representing k-mers together with their counts. Storing k-mers is not needed if the set of k-mers is known, making it possible to only keep counters and their association to k-mers. Solutions avoiding explicit representation of k-mers include Minimal Perfect Hash Functions (MPHFs) and Count-Min sketches. We introduce Set-Min sketch-a sketching technique for representing associative maps inspired from Count-Min-and apply it to the problem of representing k-mer count tables. Set-Min is provably more accurate than both Count-Min and Max-Min-an improved variant of Count-Min for static datasets that we define here. We show that Set-Min sketch provides a very low error rate, in terms of both the probability and the size of errors, at the expense of a very moderate memory increase. On the other hand, Set-Min sketches are shown to take up to an order of magnitude less space than MPHF-based solutions, for fully assembled genomes and large k. Space-efficiency of Set-Min in this case takes advantage of the power-law distribution of k-mer counts in genomic datasets.

Keywords: k-mer counting; k-mer spectrum; max-min sketch; power-law distribution; set-min sketch; sketching.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Animals
  • Computational Biology / methods*
  • Computer Graphics
  • Databases, Genetic / statistics & numerical data
  • Genome, Human
  • Genomics / statistics & numerical data*
  • Humans
  • Models, Statistical
  • Molecular Sequence Annotation / statistics & numerical data
  • Software*