Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Sep;154(1):36-54.
doi: 10.1104/pp.110.156851. Epub 2010 Jul 20.

Combining machine learning and homology-based approaches to accurately predict subcellular localization in Arabidopsis

Affiliations
Free PMC article

Combining machine learning and homology-based approaches to accurately predict subcellular localization in Arabidopsis

Rakesh Kaundal et al. Plant Physiol. 2010 Sep.
Free PMC article

Abstract

A complete map of the Arabidopsis (Arabidopsis thaliana) proteome is clearly a major goal for the plant research community in terms of determining the function and regulation of each encoded protein. Developing genome-wide prediction tools such as for localizing gene products at the subcellular level will substantially advance Arabidopsis gene annotation. To this end, we performed a comprehensive study in Arabidopsis and created an integrative support vector machine-based localization predictor called AtSubP (for Arabidopsis subcellular localization predictor) that is based on the combinatorial presence of diverse protein features, such as its amino acid composition, sequence-order effects, terminal information, Position-Specific Scoring Matrix, and similarity search-based Position-Specific Iterated-Basic Local Alignment Search Tool information. When used to predict seven subcellular compartments through a 5-fold cross-validation test, our hybrid-based best classifier achieved an overall sensitivity of 91% with high-confidence precision and Matthews correlation coefficient values of 90.9% and 0.89, respectively. Benchmarking AtSubP on two independent data sets, one from Swiss-Prot and another containing green fluorescent protein- and mass spectrometry-determined proteins, showed a significant improvement in the prediction accuracy of species-specific AtSubP over some widely used "general" tools such as TargetP, LOCtree, PA-SUB, MultiLoc, WoLF PSORT, Plant-PLoc, and our newly created All-Plant method. Cross-comparison of AtSubP on six nontrained eukaryotic organisms (rice [Oryza sativa], soybean [Glycine max], human [Homo sapiens], yeast [Saccharomyces cerevisiae], fruit fly [Drosophila melanogaster], and worm [Caenorhabditis elegans]) revealed inferior predictions. AtSubP significantly outperformed all the prediction tools being currently used for Arabidopsis proteome annotation and, therefore, may serve as a better complement for the plant research community. A supplemental Web site that hosts all the training/testing data sets and whole proteome predictions is available at http://bioinfo3.noble.org/AtSubP/.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Performance comparison of overall sensitivities achieved by PSI-BLAST and various SVM modules constructed on the basis of different features of a protein sequence. For detailed performance of each classifier, see individual tables in Supplemental Data.
Figure 2.
Figure 2.
Average amino acid composition of the first 30 residues at the N-terminal region (potentially the cTP-containing region) of chloroplast-localized proteins in Arabidopsis compared with other plant cTPs. The pie charts at the top show the same data except that the amino acid types have been grouped by the electrostatic properties of their side chains. [See online article for color version of this figure.]
Figure 3.
Figure 3.
Expected prediction accuracy with a RI equal to a given value for the best classifier (based on the performance on independent test set I). The fractions of sequences that are predicted with RI ≥ 1, 2, 3, 4, or 5 are also given. An RI curve based on a 5-fold cross-validation test is provided in the Supplemental Figure S6. [See online article for color version of this figure.]
Figure 4.
Figure 4.
ROC curves for the best classifier (based on the performance on independent test set I). A plot of the ROC curve for each localization is shown. The ontological labels are as follows: Chloro(plast), Cyto(plasm), Golgi (apparatus), Mito(chondria), Extracell(ular), Nucl(eus), and Cel(l) memb(rane). ROC curves based on a 5-fold cross-validation test are provided in the Supplemental Figure S7. [See online article for color version of this figure.]
Figure 5.
Figure 5.
Overall architecture of methodology followed for developing one similarity-based PSI-BLAST and 14 diverse SVM-based classifiers using various protein features. [See online article for color version of this figure.]
Figure 6.
Figure 6.
Schematic representation of the algorithm used to convert L × 20 size PSSM matrix into a 400-D input vector. The PSSM provides a matrix of dimension L rows and 20 columns for a protein chain of L amino acid residues, where 20 columns represent the occurrence/substitution of each type of 20 amino acids. [See online article for color version of this figure.]

Similar articles

Cited by

References

    1. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. (1997) Gapped Blast and PSI-Blast: a new generation of protein database search programs. Nucleic Acids Res 25: 3389–3402 - PMC - PubMed
    1. Andersen JS, Mann M. (2006) Organellar proteomics: turning inventories into insights. EMBO Rep 7: 874–879 - PMC - PubMed
    1. Andrade MA, O’Donoghue SI, Rost B. (1998) Adaptation of protein surfaces to subcellular location. J Mol Biol 276: 517–525 - PubMed
    1. Bhasin M, Raghava GPS. (2004) ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST. Nucleic Acids Res 32: 414–419 - PMC - PubMed
    1. Bogatyreva NS, Finkelstein AV, Galzitskaya OV. (2006) Trend of amino acid composition of proteins of different taxa. J Bioinform Comput Biol 4: 597–608 - PubMed

Publication types

LinkOut - more resources