Why neural networks should not be used for HIV-1 protease cleavage site prediction

Thorsteinn Rögnvaldsson; Liwen You

doi:10.1093/bioinformatics/bth144

Why neural networks should not be used for HIV-1 protease cleavage site prediction

Bioinformatics. 2004 Jul 22;20(11):1702-9. doi: 10.1093/bioinformatics/bth144. Epub 2004 Feb 26.

Authors

Thorsteinn Rögnvaldsson¹, Liwen You

Affiliation

¹ Intelligent Systems Laboratory, School of Information Science, Computer and Electrical Engineering, Halmstad University, Box 823, 301 18 Sweden. denni@ide.hh.se

PMID: 14988129
DOI: 10.1093/bioinformatics/bth144

Abstract

Several papers have been published where nonlinear machine learning algorithms, e.g. artificial neural networks, support vector machines and decision trees, have been used to model the specificity of the HIV-1 protease and extract specificity rules. We show that the dataset used in these studies is linearly separable and that it is a misuse of nonlinear classifiers to apply them to this problem. The best solution on this dataset is achieved using a linear classifier like the simple perceptron or the linear support vector machine, and it is straightforward to extract rules from these linear models. We identify key residues in peptides that are efficiently cleaved by the HIV-1 protease and list the most prominent rules, relating them to experimental results for the HIV-1 protease.

Motivation: Understanding HIV-1 protease specificity is important when designing HIV inhibitors and several different machine learning algorithms have been applied to the problem. However, little progress has been made in understanding the specificity because nonlinear and overly complex models have been used.

Results: We show that the problem is much easier than what has previously been reported and that linear classifiers like the simple perceptron or linear support vector machines are at least as good predictors as nonlinear algorithms. We also show how sets of specificity rules can be generated from the resulting linear classifiers.

Availability: The datasets used are available at http://www.hh.se/staff/bioinf/

Publication types

Comparative Study
Evaluation Study
Validation Study

MeSH terms

Algorithms*
Artificial Intelligence*
Binding Sites
Computer Simulation
Databases, Protein
Enzyme Activation
HIV Protease / chemistry*
Linear Models
Models, Chemical
Neural Networks, Computer*
Nonlinear Dynamics
Pattern Recognition, Automated*
Protein Binding
Protein Interaction Mapping / methods*
Reproducibility of Results
Sensitivity and Specificity
Sequence Alignment / methods
Sequence Analysis, Protein / methods*

Substances

HIV Protease