Linear regression models for solvent accessibility prediction in proteins

Michael Wagner; Rafał Adamczak; Aleksey Porollo; Jarosław Meller

doi:10.1089/cmb.2005.12.355

Linear regression models for solvent accessibility prediction in proteins

J Comput Biol. 2005 Apr;12(3):355-69. doi: 10.1089/cmb.2005.12.355.

Authors

Michael Wagner¹, Rafał Adamczak, Aleksey Porollo, Jarosław Meller

Affiliation

¹ Division of Biomedical Informatics, Cincinnati Children's Hospital Research Foundation, 3333 Burnet Avenue, Cincinnati, OH 45229, USA.

PMID: 15857247
DOI: 10.1089/cmb.2005.12.355

Abstract

The relative solvent accessibility (RSA) of an amino acid residue in a protein structure is a real number that represents the solvent exposed surface area of this residue in relative terms. The problem of predicting the RSA from the primary amino acid sequence can therefore be cast as a regression problem. Nevertheless, RSA prediction has so far typically been cast as a classification problem. Consequently, various machine learning techniques have been used within the classification framework to predict whether a given amino acid exceeds some (arbitrary) RSA threshold and would thus be predicted to be "exposed," as opposed to "buried." We have recently developed novel methods for RSA prediction using nonlinear regression techniques which provide accurate estimates of the real-valued RSA and outperform classification-based approaches with respect to commonly used two-class projections. However, while their performance seems to provide a significant improvement over previously published approaches, these Neural Network (NN) based methods are computationally expensive to train and involve several thousand parameters. In this work, we develop alternative regression models for RSA prediction which are computationally much less expensive, involve orders-of-magnitude fewer parameters, and are still competitive in terms of prediction quality. In particular, we investigate several regression models for RSA prediction using linear L1-support vector regression (SVR) approaches as well as standard linear least squares (LS) regression. Using rigorously derived validation sets of protein structures and extensive cross-validation analysis, we compare the performance of the SVR with that of LS regression and NN-based methods. In particular, we show that the flexibility of the SVR (as encoded by metaparameters such as the error insensitivity and the error penalization terms) can be very beneficial to optimize the prediction accuracy for buried residues. We conclude that the simple and computationally much more efficient linear SVR performs comparably to nonlinear models and thus can be used in order to facilitate further attempts to design more accurate RSA prediction methods, with applications to fold recognition and de novo protein structure prediction methods.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, P.H.S.

MeSH terms

Amino Acids / chemistry
Computer Simulation
Data Interpretation, Statistical
Least-Squares Analysis
Linear Models
Proteins / chemistry*
Solubility
Solvents / chemistry*

Substances

Amino Acids
Proteins
Solvents

Abstract

Publication types

MeSH terms

Substances

Grants and funding