Exploiting MEDLINE for gene molecular function prediction via NMF based multi-label classification

J Biomed Inform. 2018 Oct:86:160-166. doi: 10.1016/j.jbi.2018.08.009. Epub 2018 Aug 18.

Abstract

Gene ontology (GO) provides a representation of terms and categories used to describe genes and their molecular functions, cellular components and biological processes. GO has been the standard for describing the functions of specific genes in different model organisms. GO annotation, or the tagging of genes with GO terms, has mostly been a manual and time-consuming curation process. Although many automated approaches have been proposed for annotation, few have utilized knowledge available in the literature. In this manuscript, we describe the development and evaluation of an innovative predictive system to automatically assign molecular functions (GO terms) to genes using the biomedical literature. Because genes could be associated with multiple molecular functions, we posed the GO molecular function annotation as a multi-label classification problem with several classes. We used non-negative matrix factorization (NMF) for feature reduction and then classified the genes. To address the multi-label aspect of the data, we used the binary-relevance method. Although we experimented with several classifiers, the combination of binary-relevance and K-nearest neighbor (KNN) classifier performed best. Our evaluation on UniProtKB/Swiss-Prot dataset showed the best performance of 0.84 in terms of F1-measure.

Keywords: Annotation; GO; Genes; Multi-label classification; NMF.

MeSH terms

  • Algorithms
  • Animals
  • Computational Biology / methods*
  • Databases, Genetic*
  • Databases, Protein*
  • Decision Trees
  • Gene Ontology*
  • Humans
  • MEDLINE*
  • Markov Chains
  • Models, Statistical
  • Molecular Sequence Annotation
  • Predictive Value of Tests
  • Reproducibility of Results