Indexing Arbitrary-Length k-Mers in Sequencing Reads

PLoS One. 2015 Jul 16;10(7):e0133198. doi: 10.1371/journal.pone.0133198. eCollection 2015.

Abstract

We propose a lightweight data structure for indexing and querying collections of NGS reads data in main memory. The data structure supports the interface proposed in the pioneering work by Philippe et al. for counting and locating k-mers in sequencing reads. Our solution, PgSA (pseudogenome suffix array), based on finding overlapping reads, is competitive to the existing algorithms in the space use, query times, or both. The main applications of our index include variant calling, error correction and analysis of reads from RNA-seq experiments.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Animals
  • Caenorhabditis elegans / genetics
  • Datasets as Topic
  • Escherichia coli / genetics
  • Genome*
  • High-Throughput Nucleotide Sequencing
  • Humans
  • Sequence Analysis, RNA / methods
  • Sequence Analysis, RNA / statistics & numerical data*
  • Software*

Grants and funding

This work was supported by The Polish National Science Centre under the project DEC-2012/05/B/ST6/03148. The infrastructure was supported by POIG.02.03.01-24-099/13 grant “GeCONiI---Upper Silesian Center for Computational Science and Engineering.” The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.