The lexical properties of the gene ontology

Proc AMIA Symp. 2002:504-8.

Abstract

The Gene Ontology (GO) is a construct developed for the purpose of annotating molecular information about genes and their products. The ontology is a shared resource developed by the GO Consortium, a group of scientists who work on a variety of model organisms. In this paper we investigate the nature of the strings found in the Gene Ontology and evaluate them for their usefulness in natural language processing (NLP). We extend previous work that identified a set of properties that reliably identifies natural language phrases in the Unified Medical Language System (UMLS). The results indicate that a large percentage (79%) of GO terms are potentially useful for NLP applications. Some 35% of the GO terms were found in a corpus derived from the MEDLINE bibliographic database, and 27% of the terms were found in the current edition of the UMLS.

MeSH terms

  • Genes*
  • Natural Language Processing
  • Subject Headings
  • Systematized Nomenclature of Medicine
  • Unified Medical Language System
  • Vocabulary, Controlled*