Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 16 Suppl 6 (Suppl 6), S4

Computational Algorithms to Predict Gene Ontology Annotations

Computational Algorithms to Predict Gene Ontology Annotations

Pietro Pinoli et al. BMC Bioinformatics.

Abstract

Background: Gene function annotations, which are associations between a gene and a term of a controlled vocabulary describing gene functional features, are of paramount importance in modern biology. Datasets of these annotations, such as the ones provided by the Gene Ontology Consortium, are used to design novel biological experiments and interpret their results. Despite their importance, these sources of information have some known issues. They are incomplete, since biological knowledge is far from being definitive and it rapidly evolves, and some erroneous annotations may be present. Since the curation process of novel annotations is a costly procedure, both in economical and time terms, computational tools that can reliably predict likely annotations, and thus quicken the discovery of new gene annotations, are very useful.

Methods: We used a set of computational algorithms and weighting schemes to infer novel gene annotations from a set of known ones. We used the latent semantic analysis approach, implementing two popular algorithms (Latent Semantic Indexing and Probabilistic Latent Semantic Analysis) and propose a novel method, the Semantic IMproved Latent Semantic Analysis, which adds a clustering step on the set of considered genes. Furthermore, we propose the improvement of these algorithms by weighting the annotations in the input set.

Results: We tested our methods and their weighted variants on the Gene Ontology annotation sets of three model organism genes (Bos taurus, Danio rerio and Drosophila melanogaster ). The methods showed their ability in predicting novel gene annotations and the weighting procedures demonstrated to lead to a valuable improvement, although the obtained results vary according to the dimension of the input annotation set and the considered algorithm.

Conclusions: Out of the three considered methods, the Semantic IMproved Latent Semantic Analysis is the one that provides better results. In particular, when coupled with a proper weighting policy, it is able to predict a significant number of novel annotations, demonstrating to actually be a helpful tool in supporting scientists in the curation process of gene functional annotations.

Figures

Figure 1
Figure 1
Prediction workflow. The input is a gene annotation repository. Firstly, the contained annotations of interest are represented in a computable structure (i.e. a matrix, binary or weighted). Then, this representation is used as training dataset for a machine learning method that fits a predictive model of gene annotations. Finally, the estimated model is treated as a generative process and new putative annotations are produced, along with a confidence value.
Figure 2
Figure 2
Truncated Singular Value Decomposition. Given a truncation level k, an approximation of the W matrix is built keeping into account only the first k columns of the left singular vector matrix U and of the right singular vector matrix V and the k × k portion of the diagonal matrix S of the singular values of W. Considered sub matrices are highlighted.
Figure 3
Figure 3
pLSAnorm aspect model. Each gene is associated with each function term through hidden variables, the topics. Connections between nodes represent probability values.
Figure 4
Figure 4
ROC curves for the Bos taurus datasets. ROC curves and their AUC percentages of Annotation Confirmed rate (AC rate) versus Annotation Predicted rate (AP rate), obtained by varying the threshold τ in predicting the GO annotations of Bos taurus genes with the LSI (a), SIM (b) or pLSA (c) methods, each with or without weighting schemes.
Figure 5
Figure 5
Predictions for the PGRP-LB gene. Branch of the Directed Acyclic Graph of the GO Biological Process terms associated with the PGRP-LB Peptidoglycan recognition protein LB gene (Entrez Gene ID: 41379) of the Drosophila melanogaster organism. It includes GO terms present in the analyzed dataset (black circles), as well as GO terms predicted by the SIM method with the NTM weighting schema as associated with the same gene (blue hexagons) and the ones of them that were found validated in the dataset updated version (green rectangles). Other GO DAG parts are connected to the shown branch as indicated by the dotted lines.

Similar articles

See all similar articles

Cited by 5 PubMed Central articles

References

    1. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM. et al. Gene Ontology: tool for the unification of biology. Nat Genetic. 2000;25(1):25–29. doi: 10.1038/75556. - DOI - PMC - PubMed
    1. Pandey G, Kumar V, Steinbach M. Technical Report. Department of Computer Science and Engineering, University of Minnesota; 2006. Computational approaches for protein function prediction: A survey.
    1. King OD, Foulger RE, Dwight SS, White JV, Roth FP. Predicting gene function from patterns of annotation. Genome Res. 2013;13(5):896–904. - PMC - PubMed
    1. Tao Y, Sam L, Li J, Friedman C, Lussier YA. Information theory applied to the sparse gene ontology annotation network to predict novel gene function. Bioinformatics. 2007;23(13):i529–i538. doi: 10.1093/bioinformatics/btm195. - DOI - PMC - PubMed
    1. Barutcuoglu Z, Schapire RE, Troyanskaya OG. Hierarchical multi-label prediction of gene function. Bioinformatics. 2006;22(7):830–836. doi: 10.1093/bioinformatics/btk048. - DOI - PubMed

Substances

LinkOut - more resources

Feedback