Simrank: Rapid and sensitive general-purpose k-mer search tool

Todd Z DeSantis; Keith Keller; Ulas Karaoz; Alexander V Alekseyenko; Navjeet N S Singh; Eoin L Brodie; Zhiheng Pei; Gary L Andersen; Niels Larsen

doi:10.1186/1472-6785-11-11

Simrank: Rapid and sensitive general-purpose k-mer search tool

BMC Ecol. 2011 Apr 27:11:11. doi: 10.1186/1472-6785-11-11.

Authors

Todd Z DeSantis¹, Keith Keller, Ulas Karaoz, Alexander V Alekseyenko, Navjeet N S Singh, Eoin L Brodie, Zhiheng Pei, Gary L Andersen, Niels Larsen

Affiliation

¹ Ecology Department, Lawrence Berkeley National Laboratory, Berkeley, USA. tdesantis@lbl.gov

Abstract

Background: Terabyte-scale collections of string-encoded data are expected from consortia efforts such as the Human Microbiome Project http://nihroadmap.nih.gov/hmp. Intra- and inter-project data similarity searches are enabled by rapid k-mer matching strategies. Software applications for sequence database partitioning, guide tree estimation, molecular classification and alignment acceleration have benefited from embedded k-mer searches as sub-routines. However, a rapid, general-purpose, open-source, flexible, stand-alone k-mer tool has not been available.

Results: Here we present a stand-alone utility, Simrank, which allows users to rapidly identify database strings the most similar to query strings. Performance testing of Simrank and related tools against DNA, RNA, protein and human-languages found Simrank 10X to 928X faster depending on the dataset.

Conclusions: Simrank provides molecular ecologists with a high-throughput, open source choice for comparing large sequence sets to find similarity.

Publication types

Research Support, N.I.H., Extramural
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Computational Biology
DNA
Databases, Bibliographic*
Databases, Factual*
Molecular Biology*
Proteins
RNA
Software*

Substances

Proteins
RNA
DNA

Abstract

Publication types

MeSH terms

Substances

Grants and funding