Bi-directional semantic similarity for gene ontology to optimize biological and clinical analyses

Sang Jay Bien; Chan Hee Park; Hae Jin Shim; Woongcheol Yang; Jihun Kim; Ju Han Kim

doi:10.1136/amiajnl-2011-000659

Bi-directional semantic similarity for gene ontology to optimize biological and clinical analyses

J Am Med Inform Assoc. 2012 Sep-Oct;19(5):765-74. doi: 10.1136/amiajnl-2011-000659. Epub 2012 Feb 28.

Authors

Sang Jay Bien¹, Chan Hee Park, Hae Jin Shim, Woongcheol Yang, Jihun Kim, Ju Han Kim

Affiliation

¹ Seoul National University Biomedical Informatics, Seoul, Korea.

Abstract

Background: Semantic similarity analysis facilitates automated semantic explanations of biological and clinical data annotated by biomedical ontologies. Gene ontology (GO) has become one of the most important biomedical ontologies with a set of controlled vocabularies, providing rich semantic annotations for genes and molecular phenotypes for diseases. Current methods for measuring GO semantic similarities are limited to considering only the ancestor terms while neglecting the descendants. One can find many GO term pairs whose ancestors are identical but whose descendants are very different and vice versa. Moreover, the lower parts of GO trees are full of terms with more specific semantics.

Methods: This study proposed a method of measuring semantic similarities between GO terms using the entire GO tree structure, including both the upper (ancestral) and the lower (descendant) parts. Comprehensive comparison studies were performed with well-known information content-based and graph structure-based semantic similarity measures with protein sequence similarities, gene expression-profile correlations, protein-protein interactions, and biological pathway analyses.

Conclusion: The proposed bidirectional measure of semantic similarity outperformed other graph-based and information content-based methods.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Computational Biology / methods*
Databases, Protein
Gene Expression Profiling
Humans
Natural Language Processing*
Proteins / genetics
ROC Curve
Reproducibility of Results
Semantics*
Vocabulary, Controlled*

Substances

Proteins