Multiple seeds sensitivity using a single seed with threshold

J Bioinform Comput Biol. 2015 Aug;13(4):1550011. doi: 10.1142/S0219720015500110. Epub 2015 Feb 3.

Abstract

Spaced seeds are a fundamental tool for similarity search in biosequences. The best sensitivity/selectivity trade-offs are obtained using many seeds simultaneously: This is known as the multiple seed approach. Unfortunately, spaced seeds use a large amount of memory and the available RAM is a practical limit to the number of seeds one can use simultaneously. Inspired by some recent results on lossless seeds, we revisit the approach of using a single spaced seed and considering two regions homologous if the seed hits in at least t sufficiently close positions. We show that by choosing the locations of the don't care symbols in the seed using quadratic residues modulo a prime number, we derive single seeds that when used with a threshold t > 1 have competitive sensitivity/selectivity trade-offs, indeed close to the best multiple seeds known in the literature. In addition, the choice of the threshold t can be adjusted to modify sensitivity and selectivity a posteriori, thus enabling a more accurate search in the specific instance at issue. The seeds we propose also exhibit robustness and allow flexibility in usage.

Keywords: Homology search; sensitivity/selectivity tradeoffs; spaced seeds.

MeSH terms

  • Algorithms
  • Computational Biology / methods*
  • Sequence Homology*
  • Software