Variable-length positional modeling for biological sequence classification

AMIA Annu Symp Proc. 2008 Nov 6:2008:91-5.

Abstract

Selecting the most informative features in supervised biological classification problems is a decisive pre-processing step for two main reasons: (1) to deal with the dimensionality reduction problem, and (2) to ascribe biological meaning to the underlying feature interactions. This paper presents a filter-based feature selection method that is suitable for positional modeling of biological sequences. The basic motivation is the problem of using a positional model of fixed length that sub-optimally describes biological sequences in a specific classification problem. The core filtering criterion is the F-score and the source features are the positional probabilities describing variable-length interactions among residues. The proposed method was evaluated on human splice sites classification using a linear SVM classifier. The method yields to superior classification accuracy compared to the individual positional models, while it maintains the space complexity of the individual models, in a time-efficient way and independently of the classifier.

MeSH terms

  • Algorithms*
  • Amino Acid Sequence
  • Base Sequence
  • Computer Simulation
  • Models, Chemical*
  • Models, Genetic*
  • Models, Statistical
  • Molecular Sequence Data
  • Pattern Recognition, Automated / methods*
  • Sequence Alignment / methods*
  • Sequence Analysis / methods*