Motivation: Knowledge of the correct protein subcellular localization is necessary for understanding the function of a protein. Unfortunately large-scale experimental studies are limited in their accuracy. Therefore, the development of prediction methods has been limited by the amount of accurate experimental data. However, recently large-scale experimental studies have provided new data that can be used to evaluate the accuracy of subcellular predictions in human cells. Using this data we examined the performance of state of the art methods and developed SubCons, an ensemble method that combines four predictors using a Random Forest classifier.
Results: SubCons outperforms earlier methods in a dataset of proteins where two independent methods confirm the subcellular localization. Given nine subcellular localizations, SubCons achieves an F1-Score of 0.79 compared to 0.70 of the second best method. Furthermore, at a FPR of 1% the true positive rate (TPR) is over 58% for SubCons compared to less than 50% for the best individual predictor.
Availability and implementation: SubCons is freely available as a webserver (http://subcons.bioinfo.se) and source code from https://bitbucket.org/salvatore_marco/subcons-web-server. The golden dataset as well is available from http://subcons.bioinfo.se/pred/download.
Supplementary information: Supplementary data are available at Bioinformatics online.
© The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: email@example.com