TANGO: A GO-Term Embedding Based Method for Protein Semantic Similarity Prediction

IEEE/ACM Trans Comput Biol Bioinform. 2023 Jan-Feb;20(1):694-706. doi: 10.1109/TCBB.2022.3143480. Epub 2023 Feb 3.

Abstract

We aim to quantitatively predict protein semantic similarities (PSS), which is vital to making biological discoveries. Previously, researchers commonly exploited Gene Ontology (GO) graphs (containing standardized hierarchically-organized GO terms for annotating distinct protein attributes) to learn GO term embeddings (vector representations) for quantifying protein attribute similarities and aggregate these embeddings to form protein embeddings for similarity measurement. However, two key properties of GO terms and annotated proteins are not yet well-explored by these learning-based methods: (1) taxonomy relations between GO terms; (2) GO terms' different contributions in describing protein semantics. In this paper, we propose TANGO, a new framework composed of a TAxoNomy-aware embedding module and an aggreGatiOn module. Our Embedding Module encodes taxonomic information into GO term embeddings by incorporating GO term topological distances in the GO graph hierarchy. Hence, distances between GO term embeddings can be used to more accurately measure shared meanings between correlated protein attributes. Our Aggregation Module automatically determines the contributions of GO terms when merging into the target protein embeddings, by mining GO term concept dependency relations in the GO graph and correlations in protein annotations. We conduct extensive experiments on several public datasets. On two PSS metrics, our new method significantly outperforms known methods by a large margin.

Publication types

  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Gene Ontology
  • Molecular Sequence Annotation
  • Proteins* / genetics
  • Semantics*

Substances

  • Proteins