Data mining tools for biological sequences

J Bioinform Comput Biol. 2003 Apr;1(1):139-67. doi: 10.1142/s0219720003000216.


We describe a methodology, as well as some related data mining tools, for analyzing sequence data. The methodology comprises three steps: (a) generating candidate features from the sequences, (b) selecting relevant features from the candidates, and (c) integrating the selected features to build a system to recognize specific properties in sequence data. We also give relevant techniques for each of these three steps. For generating candidate features, we present various types of features based on the idea of k-grams. For selecting relevant features, we discuss signal-to-noise, t-statistics, and entropy measures, as well as a correlation-based feature selection method. For integrating selected features, we use machine learning methods, including C4.5, SVM, and Naive Bayes. We illustrate this methodology on the problem of recognizing translation initiation sites. We discuss how to generate and select features that are useful for understanding the distinction between ATG sites that are translation initiation sites and those that are not. We also discuss how to use such features to build reliable systems for recognizing translation initiation sites in DNA sequences.

Publication types

  • Review

MeSH terms

  • Artificial Intelligence
  • Base Sequence
  • Computational Biology*
  • Databases, Nucleic Acid
  • Humans
  • Peptide Chain Initiation, Translational
  • Protein Biosynthesis*
  • RNA, Messenger / genetics
  • Sequence Analysis, RNA / statistics & numerical data*


  • RNA, Messenger