Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Jul 1;29(13):i53-61.
doi: 10.1093/bioinformatics/btt228.

Information-theoretic Evaluation of Predicted Ontological Annotations

Affiliations
Free PMC article

Information-theoretic Evaluation of Predicted Ontological Annotations

Wyatt T Clark et al. Bioinformatics. .
Free PMC article

Abstract

Motivation: The development of effective methods for the prediction of ontological annotations is an important goal in computational biology, with protein function prediction and disease gene prioritization gaining wide recognition. Although various algorithms have been proposed for these tasks, evaluating their performance is difficult owing to problems caused both by the structure of biomedical ontologies and biased or incomplete experimental annotations of genes and gene products.

Results: We propose an information-theoretic framework to evaluate the performance of computational protein function prediction. We use a Bayesian network, structured according to the underlying ontology, to model the prior probability of a protein's function. We then define two concepts, misinformation and remaining uncertainty, that can be seen as information-theoretic analogs of precision and recall. Finally, we propose a single statistic, referred to as semantic distance, that can be used to rank classification models. We evaluate our approach by analyzing the performance of three protein function predictors of Gene Ontology terms and provide evidence that it addresses several weaknesses of currently used metrics. We believe this framework provides useful insights into the performance of protein function prediction tools.

Supplementary information: Supplementary data are available at Bioinformatics online.

Figures

Fig. 1.
Fig. 1.
An example of an ontology, dataset and calculation of information content. (A) An ontology viewed as a Bayesian network together with a conditional probability table assigned to each node. Each conditional probability table is limited to a single number owing to the consistency requirement in assignments of protein function. Information accretion calculated for each node, e.g. formula image, are shown in gray next to each node. (B) A dataset containing four proteins whose functional annotations are generated according to the probability distribution from the Bayesian network. (C) The total information content associated with each protein found in panel (B); e.g. formula image formula image. Note that formula image and formula image, although proteins with such annotation have not been observed in part (B)
Fig. 2.
Fig. 2.
Illustration of calculating remaining uncertainty and misinformation, given a predicted annotation graph P and a graph of true annotations T. Graphs P and T are uniquely determined by the leaf nodes p1, p2, t1, and t2, respectively. Nodes colored in gray represent graph T. Nodes circled in gray are used to determine remaining uncertainty (ru; right side) and misinformation (mi; left side) between T and P
Fig. 3.
Fig. 3.
Distribution of information content (in bits) over proteins annotated by terms for each of the three ontologies. The average information content of a protein was estimated at 10.9 (std. 10.2), 32.0 (std. 33.6) and 10.4 (std. 9.2) bits for MFO, BPO and CCO, respectively
Fig. 4.
Fig. 4.
The 2D evaluation plots. Each plot shows three prediction methods: Naive (gray, dashed), BLAST (red, solid) and GOtcha (blue, solid) constructed using cross-validation. Green point labeled GO shows the performance evaluation between two databases of experimental annotations, downloaded at the same time. The rows show the performance for different ontologies (MFO, BPO, CCO). The columns show different evaluation metrics: formula image and formula image

Similar articles

See all similar articles

Cited by 28 articles

  • UDSMProt: universal deep sequence models for protein classification.
    Strodthoff N, Wagner P, Wenzel M, Samek W. Strodthoff N, et al. Bioinformatics. 2020 Apr 15;36(8):2401-2409. doi: 10.1093/bioinformatics/btaa003. Bioinformatics. 2020. PMID: 31913448 Free PMC article.
  • Patterns of diverse gene functions in genomic neighborhoods predict gene function and phenotype.
    Mihelčić M, Šmuc T, Supek F. Mihelčić M, et al. Sci Rep. 2019 Dec 20;9(1):19537. doi: 10.1038/s41598-019-55984-0. Sci Rep. 2019. PMID: 31863070 Free PMC article.
  • The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens.
    Zhou N, Jiang Y, Bergquist TR, Lee AJ, Kacsoh BZ, Crocker AW, Lewis KA, Georghiou G, Nguyen HN, Hamid MN, Davis L, Dogan T, Atalay V, Rifaioglu AS, Dalkıran A, Cetin Atalay R, Zhang C, Hurto RL, Freddolino PL, Zhang Y, Bhat P, Supek F, Fernández JM, Gemovic B, Perovic VR, Davidović RS, Sumonja N, Veljkovic N, Asgari E, Mofrad MRK, Profiti G, Savojardo C, Martelli PL, Casadio R, Boecker F, Schoof H, Kahanda I, Thurlby N, McHardy AC, Renaux A, Saidi R, Gough J, Freitas AA, Antczak M, Fabris F, Wass MN, Hou J, Cheng J, Wang Z, Romero AE, Paccanaro A, Yang H, Goldberg T, Zhao C, Holm L, Törönen P, Medlar AJ, Zosa E, Borukhov I, Novikov I, Wilkins A, Lichtarge O, Chi PH, Tseng WC, Linial M, Rose PW, Dessimoz C, Vidulin V, Dzeroski S, Sillitoe I, Das S, Lees JG, Jones DT, Wan C, Cozzetto D, Fa R, Torres M, Warwick Vesztrocy A, Rodriguez JM, Tress ML, Frasca M, Notaro M, Grossi G, Petrini A, Re M, Valentini G, Mesiti M, Roche DB, Reeb J, Ritchie DW, Aridhi S, Alborzi SZ, Devignes MD, Koo DCE, Bonneau R, Gligorijević V, Barot M, Fang H, Toppo S, Lavezzo E, Falda M, Berselli M, Tosatto SCE, Carraro M, Piovesan D, Ur Rehman H, Mao Q, Zhang S, Vucetic S, Black GS, Jo D, Suh E, Dayton JB, Larsen DJ, Omdahl AR, McGuffin LJ, Brackenridge DA, Babbitt PC, Yunes JM, Fontana P, Zhang F, Zhu S, You R, Zhang Z, Dai S, Yao S, Tian W, Cao R, Chandler C, Amezola M, Johnson D, Chang JM, Liao WH, Liu YW, Pascarelli S, Frank Y, Hoehndorf R, Kulmanov M, Boudellioua I, Politano G, Di Carlo S, Benso A, Hakala K, Ginter F, Mehryary F, Kaewphan S, Björne J, Moen H, Tolvanen MEE, Salakoski T, Kihara D, Jain A, Šmuc T, Altenhoff A, Ben-Hur A, Rost B, Brenner SE, Orengo CA, Jeffery CJ, Bosco G, Hogan DA, Martin MJ, O'Donovan C, Mooney SD, Greene CS, Radivojac P, Friedberg I. Zhou N, et al. Genome Biol. 2019 Nov 19;20(1):244. doi: 10.1186/s13059-019-1835-8. Genome Biol. 2019. PMID: 31744546 Free PMC article.
  • Novel comparison of evaluation metrics for gene ontology classifiers reveals drastic performance differences.
    Plyusnin I, Holm L, Törönen P. Plyusnin I, et al. PLoS Comput Biol. 2019 Nov 4;15(11):e1007419. doi: 10.1371/journal.pcbi.1007419. eCollection 2019 Nov. PLoS Comput Biol. 2019. PMID: 31682632 Free PMC article.
  • Maize GO Annotation-Methods, Evaluation, and Review (maize-GAMER).
    Wimalanathan K, Friedberg I, Andorf CM, Lawrence-Dill CJ. Wimalanathan K, et al. Plant Direct. 2018 Apr 11;2(4):e00052. doi: 10.1002/pld3.52. eCollection 2018 Apr. Plant Direct. 2018. PMID: 31245718 Free PMC article.
See all "Cited by" articles

References

    1. Alterovitz G, et al. Ontology engineering. Nat. Biotechnol. 2010;28:128–130. - PMC - PubMed
    1. Altschul SF, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. - PMC - PubMed
    1. Ashburner M, et al. Gene ontology: tool for the unification of biology. The gene ontology consortium. Nat. Genet. 2000;25:25–29. - PMC - PubMed
    1. Clark WT, Radivojac P. Analysis of protein function and its prediction from amino acid sequence. Proteins. 2011;79:2086–2096. - PubMed
    1. Guzzi PH, et al. Semantic similarity analysis of protein data: assessment with biological features and issues. Brief. Bioinform. 2012;13:569–585. - PubMed

Publication types

Feedback