TrieAMD: a scalable and efficient apriori motif discovery approach

Int J Data Min Bioinform. 2015;13(1):13-30. doi: 10.1504/ijdmb.2015.070833.

Abstract

Motif discovery is the problem of finding recurring patterns in biological sequences. It is one of the hardest and long-standing problems in bioinformatics. Apriori is a well-known data-mining algorithm for the discovery of frequent patterns in large datasets. In this paper, we apply the Apriori algorithm and use the Trie data structure to discover motifs. We propose several modifications so that we can adapt the classic Apriori to our problem. Experiments are conducted on Tompa's benchmark to investigate the performance of our proposed algorithm, the Trie-based Apriori Motif Discovery (TrieAMD). Results show that our algorithm outperforms all of the tested tools on real datasets for the average sensitivity measure, which means that our approach is able to discover more motifs. In terms of specificity, the performance of our algorithm is comparable to the other tools. The results also confirm both linear time and linear space scalability of the algorithm.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Amino Acid Motifs
  • Data Mining / methods*
  • Databases, Protein*
  • Proteins / chemistry
  • Proteins / genetics*
  • Sequence Analysis, Protein / methods*
  • Software*

Substances

  • Proteins