Variable-length positional modeling for biological sequence classification

Andigoni Malousi; Ioanna Chouvarda; Vassilis Koutkias; Sofia Kouidou; Nicos Maglaveras

Variable-length positional modeling for biological sequence classification

AMIA Annu Symp Proc. 2008 Nov 6:2008:91-5.

Authors

Andigoni Malousi¹, Ioanna Chouvarda, Vassilis Koutkias, Sofia Kouidou, Nicos Maglaveras

Affiliation

¹ Lab. of Medical Informatics, Aristotle University of Thessaloniki, Greece.

PMID: 18999162
PMCID: PMC2656059

Abstract

Selecting the most informative features in supervised biological classification problems is a decisive pre-processing step for two main reasons: (1) to deal with the dimensionality reduction problem, and (2) to ascribe biological meaning to the underlying feature interactions. This paper presents a filter-based feature selection method that is suitable for positional modeling of biological sequences. The basic motivation is the problem of using a positional model of fixed length that sub-optimally describes biological sequences in a specific classification problem. The core filtering criterion is the F-score and the source features are the positional probabilities describing variable-length interactions among residues. The proposed method was evaluated on human splice sites classification using a linear SVM classifier. The method yields to superior classification accuracy compared to the individual positional models, while it maintains the space complexity of the individual models, in a time-efficient way and independently of the classifier.

MeSH terms

Algorithms*
Amino Acid Sequence
Base Sequence
Computer Simulation
Models, Chemical*
Models, Genetic*
Models, Statistical
Molecular Sequence Data
Pattern Recognition, Automated / methods*
Sequence Alignment / methods*
Sequence Analysis / methods*