Automatically identifying gene/protein terms in MEDLINE abstracts

Hong Yu; Vasileios Hatzivassiloglou; Andrey Rzhetsky; W John Wilbur

doi:10.1016/s1532-0464(03)00032-7

Automatically identifying gene/protein terms in MEDLINE abstracts

J Biomed Inform. 2002 Oct-Dec;35(5-6):322-30. doi: 10.1016/s1532-0464(03)00032-7.

Authors

Hong Yu¹, Vasileios Hatzivassiloglou, Andrey Rzhetsky, W John Wilbur

Affiliation

¹ Department of Computer Science, Columbia University, 1214 Amsterdam Avenue, New York, NY 10027, USA. Hongyu@cs.columbia.edu

PMID: 12968781
DOI: 10.1016/s1532-0464(03)00032-7

Abstract

Motivation: Natural language processing (NLP) techniques are used to extract information automatically from computer-readable literature. In biology, the identification of terms corresponding to biological substances (e.g., genes and proteins) is a necessary step that precedes the application of other NLP systems that extract biological information (e.g., protein-protein interactions, gene regulation events, and biochemical pathways). We have developed GPmarkup (for "gene/protein-full name mark up"), a software system that automatically identifies gene/protein terms (i.e., symbols or full names) in MEDLINE abstracts. As a part of marking up process, we also generated automatically a knowledge source of paired gene/protein symbols and full names (e.g., LARD for lymphocyte associated receptor of death) from MEDLINE. We found that many of the pairs in our knowledge source do not appear in the current GenBank database. Therefore our methods may also be used for automatic lexicon generation.

Results: GPmarkup has 73% recall and 93% precision in identifying and marking up gene/protein terms in MEDLINE abstracts.

Availability: A random sample of gene/protein symbols and full names and a sample set of marked up abstracts can be viewed at http://www.cpmc.columbia.edu/homepages/yuh9001/GPmarkup/. Contact. hy52@columbia.edu. Voice: 212-939-7028; fax: 212-666-0140.

Publication types

Research Support, U.S. Gov't, Non-P.H.S.
Research Support, U.S. Gov't, P.H.S.

MeSH terms

Abstracting and Indexing
Automation
Chromosome Mapping / methods
Genes*
MEDLINE*
Proteins*
Terminology as Topic*

Substances

Proteins

Grants and funding

R01 GM61372-01A2/GM/NIGMS NIH HHS/United States