Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. Oct-Dec 2002;35(5-6):322-30.
doi: 10.1016/s1532-0464(03)00032-7.

Automatically Identifying Gene/Protein Terms in MEDLINE Abstracts

Free article

Automatically Identifying Gene/Protein Terms in MEDLINE Abstracts

Hong Yu et al. J Biomed Inform. .
Free article


Motivation: Natural language processing (NLP) techniques are used to extract information automatically from computer-readable literature. In biology, the identification of terms corresponding to biological substances (e.g., genes and proteins) is a necessary step that precedes the application of other NLP systems that extract biological information (e.g., protein-protein interactions, gene regulation events, and biochemical pathways). We have developed GPmarkup (for "gene/protein-full name mark up"), a software system that automatically identifies gene/protein terms (i.e., symbols or full names) in MEDLINE abstracts. As a part of marking up process, we also generated automatically a knowledge source of paired gene/protein symbols and full names (e.g., LARD for lymphocyte associated receptor of death) from MEDLINE. We found that many of the pairs in our knowledge source do not appear in the current GenBank database. Therefore our methods may also be used for automatic lexicon generation.

Results: GPmarkup has 73% recall and 93% precision in identifying and marking up gene/protein terms in MEDLINE abstracts.

Availability: A random sample of gene/protein symbols and full names and a sample set of marked up abstracts can be viewed at Contact. Voice: 212-939-7028; fax: 212-666-0140.

Similar articles

See all similar articles

Cited by 5 articles

Publication types

LinkOut - more resources