Discrimination of soluble and aggregation-prone proteins based on sequence information

Yaping Fang; Jianwen Fang

doi:10.1039/c3mb70033j

Discrimination of soluble and aggregation-prone proteins based on sequence information

Mol Biosyst. 2013 Apr 5;9(4):806-11. doi: 10.1039/c3mb70033j. Epub 2013 Feb 25.

Authors

Yaping Fang¹, Jianwen Fang

Affiliation

¹ Applied Bioinformatics Laboratory, The University of Kansas, 2034 Becker Dr., Lawrence, Kansas 66047, USA. jwfang@ku.edu

Abstract

Understanding the factors governing protein solubility is a key to grasp the mechanisms of protein solubility and may provide insight into protein aggregation and misfolding related diseases such as Alzheimer's disease. In this work, we attempt to identify factors important to protein solubility using feature selection. Firstly, we calculate 1438 features including physicochemical properties and statistics for each protein. Random Forest algorithm is used to select the most informative and the minimal subset of features based on their predictive performance. A predictive model is built based on 17 selected features. Compared with previous models, our model achieves better performance with a sensitivity of 0.82, specificity 0.85, ACC 0.84, AUC 0.91 and MCC 0.67. Furthermore, a model using a redundancy-reduced dataset (sequence identity <= 30%) achieves the same performance as the model without redundancy reduction. Our results provide not only a reliable model for predicting protein solubility but also a list of features important to protein solubility. The predictive model is implemented as a freely available web application at .

Publication types

Research Support, N.I.H., Extramural

MeSH terms

Algorithms
Amino Acids / chemistry*
Databases, Protein
Humans
Internet
Models, Theoretical
Proteins / chemistry*
Sensitivity and Specificity
Software
Solubility

Substances

Amino Acids
Proteins

Abstract

Publication types

MeSH terms

Substances

Grants and funding