Quality assessment of modeled protein structure using physicochemical properties

J Bioinform Comput Biol. 2015 Apr;13(2):1550005. doi: 10.1142/S0219720015500055. Epub 2014 Dec 19.

Abstract

Physicochemical properties of proteins always guide to determine the quality of the protein structure, therefore it has been rigorously used to distinguish native or native-like structure from other predicted structures. In this work, we explore nine machine learning methods with six physicochemical properties to predict the Root Mean Square Deviation (RMSD), Template Modeling (TM-score), and Global Distance Test (GDT_TS-score) of modeled protein structure in the absence of its true native state. Physicochemical properties namely total surface area, euclidean distance (ED), total empirical energy, secondary structure penalty (SS), sequence length (SL), and pair number (PN) are used. There are a total of 95,091 modeled structures of 4896 native targets. A real coded Self-adaptive Differential Evolution algorithm (SaDE) is used to determine the feature importance. The K-fold cross validation is used to measure the robustness of the best predictive method. Through the intensive experiments, it is found that Random Forest method outperforms over other machine learning methods. This work makes the prediction faster and inexpensive. The performance result shows the prediction of RMSD, TM-score, and GDT_TS-score on Root Mean Square Error (RMSE) as 1.20, 0.06, and 0.06 respectively; correlation scores are 0.96, 0.92, and 0.91 respectively; R(2) are 0.92, 0.85, and 0.84 respectively; and accuracy are 78.82% (with ± 0.1 err), 86.56% (with ± 0.1 err), and 87.37% (with ± 0.1 err) respectively on the testing data set. The data set used in the study is available as supplement at http://bit.ly/RF-PCP-DataSets.

Keywords: Physicochemical properties of protein; SaDE; feature importance; machine learning; protein structure prediction; random forest.

Publication types

  • Evaluation Study

MeSH terms

  • Algorithms
  • Chemical Phenomena
  • Computational Biology
  • Computer Simulation
  • Databases, Protein / statistics & numerical data
  • Machine Learning
  • Models, Molecular*
  • Protein Conformation
  • Proteins / chemistry*
  • Quality Control

Substances

  • Proteins