Data structures based on k-mers for querying large collections of sequencing data sets

Camille Marchet; Christina Boucher; Simon J Puglisi; Paul Medvedev; Mikaël Salson; Rayan Chikhi

doi:10.1101/gr.260604.119

Data structures based on k-mers for querying large collections of sequencing data sets

Genome Res. 2021 Jan;31(1):1-12. doi: 10.1101/gr.260604.119. Epub 2020 Dec 16.

Authors

Camille Marchet¹, Christina Boucher², Simon J Puglisi³, Paul Medvedev^{4

5

6}, Mikaël Salson¹, Rayan Chikhi⁷

Affiliations

¹ Université de Lille, CNRS, CRIStAL UMR 9189, F-59000 Lille, France.
² Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida 32611, USA.
³ Department of Computer Science, University of Helsinki, FI-00014, Helsinki, Finland.
⁴ Department of Computer Science, The Pennsylvania State University, University Park, Pennsylvania 16802, USA.
⁵ Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA.
⁶ Center for Computational Biology and Bioinformatics, The Pennsylvania State University, University Park, Pennsylvania 16802, USA.
⁷ Institut Pasteur & CNRS, C3BI USR 3756, F-75015 Paris, France.

Abstract

High-throughput sequencing data sets are usually deposited in public repositories (e.g., the European Nucleotide Archive) to ensure reproducibility. As the amount of data has reached petabyte scale, repositories do not allow one to perform online sequence searches, yet, such a feature would be highly useful to investigators. Toward this goal, in the last few years several computational approaches have been introduced to index and query large collections of data sets. Here, we propose an accessible survey of these approaches, which are generally based on representing data sets as sets of k-mers. We review their properties, introduce a classification, and present their general intuition. We summarize their performance and highlight their current strengths and limitations.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.
Review

MeSH terms

Algorithms*
High-Throughput Nucleotide Sequencing
Reproducibility of Results
Software*

Grants and funding

R01 AI141810/AI/NIAID NIH HHS/United States