Indexing and real-time user-friendly queries in terabyte-sized complex genomic datasets with kmindex and ORA

Nat Comput Sci. 2024 Feb;4(2):104-109. doi: 10.1038/s43588-024-00596-6. Epub 2024 Feb 26.

Abstract

Public sequencing databases contain vast amounts of biological information, yet they are largely underutilized as it is challenging to efficiently search them for any sequence(s) of interest. We present kmindex, an approach that can index thousands of metagenomes and perform sequence searches in a fraction of a second. The index construction is an order of magnitude faster than previous methods, while search times are two orders of magnitude faster. With negligible false positive rates below 0.01%, kmindex outperforms the precision of existing approaches by four orders of magnitude. Here we demonstrate the scalability of kmindex by successfully indexing 1,393 marine seawater metagenome samples from the Tara Oceans project. Additionally, we introduce the publicly accessible web server Ocean Read Atlas, which enables real-time queries on the Tara Oceans dataset.

MeSH terms

  • Databases, Nucleic Acid
  • Genomics*
  • Metagenome / genetics
  • Oceans and Seas
  • Seawater*