Identifying subcellular localizations of mammalian protein complexes based on graph theory with a random forest algorithm

Mol Biosyst. 2013 Apr 5;9(4):658-67. doi: 10.1039/c3mb25451h. Epub 2013 Feb 22.

Abstract

In the post-genome era, one of the most important and challenging tasks is to identify the subcellular localizations of protein complexes, and further elucidate their functions in human health with applications to understand disease mechanisms, diagnosis and therapy. Although various experimental approaches have been developed and employed to identify the subcellular localizations of protein complexes, the laboratory technologies fall far behind the rapid accumulation of protein complexes. Therefore, it is highly desirable to develop a computational method to rapidly and reliably identify the subcellular localizations of protein complexes. In this study, a novel method is proposed for predicting subcellular localizations of mammalian protein complexes based on graph theory with a random forest algorithm. Protein complexes are modeled as weighted graphs containing nodes and edges, where nodes represent proteins, edges represent protein-protein interactions and weights are descriptors of protein primary structures. Some topological structure features are proposed and adopted to characterize protein complexes based on graph theory. Random forest is employed to construct a model and predict subcellular localizations of protein complexes. Accuracies on a training set by a 10-fold cross-validation test for predicting plasma membrane/membrane attached, cytoplasm and nucleus are 84.78%, 71.30%, and 82.00%, respectively. And accuracies for the independent test set are 81.31%, 69.95% and 81.00%, respectively. These high prediction accuracies exhibit the state-of-the-art performance of the current method. It is anticipated that the proposed method may become a useful high-throughput tool and plays a complementary role to the existing experimental techniques in identifying subcellular localizations of mammalian protein complexes. The source code of Matlab and the dataset can be obtained freely on request from the authors.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Animals
  • Humans
  • Intracellular Space
  • Models, Biological*
  • Multiprotein Complexes / metabolism*
  • Protein Transport
  • Proteins / chemistry*
  • ROC Curve
  • Reproducibility of Results

Substances

  • Multiprotein Complexes
  • Proteins