Efficient motif finding algorithms for large-alphabet inputs

BMC Bioinformatics. 2010 Oct 26;11 Suppl 8(Suppl 8):S1. doi: 10.1186/1471-2105-11-S8-S1.

Abstract

Background: We consider the problem of identifying motifs, recurring or conserved patterns, in the biological sequence data sets. To solve this task, we present a new deterministic algorithm for finding patterns that are embedded as exact or inexact instances in all or most of the input strings.

Results: The proposed algorithm (1) improves search efficiency compared to existing algorithms, and (2) scales well with the size of alphabet. On a synthetic planted DNA motif finding problem our algorithm is over 10× more efficient than MITRA, PMSPrune, and RISOTTO for long motifs. Improvements are orders of magnitude higher in the same setting with large alphabets. On benchmark TF-binding site problems (FNP, CRP, LexA) we observed reduction in running time of over 12×, with high detection accuracy. The algorithm was also successful in rapidly identifying protein motifs in Lipocalin, Zinc metallopeptidase, and supersecondary structure motifs for Cadherin and Immunoglobin families.

Conclusions: Our algorithm reduces computational complexity of the current motif finding algorithms and demonstrate strong running time improvements over existing exact algorithms, especially in important and difficult cases of large-alphabet sequences.

MeSH terms

  • Algorithms*
  • Amino Acid Motifs
  • Artificial Intelligence
  • Binding Sites*
  • Computational Biology / methods*
  • DNA / chemistry
  • Databases, Genetic
  • Nucleic Acid Conformation
  • Pattern Recognition, Automated / methods*
  • Protein Conformation
  • Sequence Analysis, DNA / methods*
  • Sequence Analysis, Protein / methods*
  • Software
  • Transcription Factors / chemistry

Substances

  • Transcription Factors
  • DNA