Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Jan 9;10:14.
doi: 10.1186/1471-2105-10-14.

MBA: A Literature Mining System for Extracting Biomedical Abbreviations

Free PMC article

MBA: A Literature Mining System for Extracting Biomedical Abbreviations

Yun Xu et al. BMC Bioinformatics. .
Free PMC article


Background: The exploding growth of the biomedical literature presents many challenges for biological researchers. One such challenge is from the use of a great deal of abbreviations. Extracting abbreviations and their definitions accurately is very helpful to biologists and also facilitates biomedical text analysis. Existing approaches fall into four broad categories: rule based, machine learning based, text alignment based and statistically based. State of the art methods either focus exclusively on acronym-type abbreviations, or could not recognize rare abbreviations. We propose a systematic method to extract abbreviations effectively. At first a scoring method is used to classify the abbreviations into acronym-type and non-acronym-type abbreviations, and then their corresponding definitions are identified by two different methods: text alignment algorithm for the former, statistical method for the latter.

Results: A literature mining system MBA was constructed to extract both acronym-type and non-acronym-type abbreviations. An abbreviation-tagged literature corpus, called Medstract gold standard corpus, was used to evaluate the system. MBA achieved a recall of 88% at the precision of 91% on the Medstract gold-standard EVALUATION Corpus.

Conclusion: We present a new literature mining system MBA for extracting biomedical abbreviations. Our evaluation demonstrates that the MBA system performs better than the others. It can identify the definition of not only acronym-type abbreviations including a little irregular acronym-type abbreviations (e.g., <CNS1, cyclophilin seven suppressor>), but also non-acronym-type abbreviations (e.g., <Fas, CD95>).


Figure 1
Figure 1
The overall architecture of the MBA system.
Figure 2
Figure 2
An example for the alignment algorithm. The definition is "Dialog Acts", and the abbreviation is "DAs". All the arrows form the best match pathway.
Figure 3
Figure 3
An example for the redundant word penalty. This is an alignment for <DER, Drosophila epidermal growth factor receptor>. In the alignment, the word "growth" in the definition is unmatched, and "factor" is also unmatched. Adjacent to each other, they are called "continuous unmatched words". The number of the continuous unmatched words is 2.

Similar articles

See all similar articles

Cited by 4 articles


    1. Jensen LJ, Saric J, Bork P. Literature mining for the biologist: from information retrieval to biological discovery. Nat Rev Gen. 2006;7:119–129. doi: 10.1038/nrg1768. - DOI - PubMed
    1. Cohen AM, Hersh WR. A survey of current work in biomedical text mining. Briefings in Bioinformatics. 2005;6:57–71. doi: 10.1093/bib/6.1.57. - DOI - PubMed
    1. Fred HL, Cheng TO. Acronymesis: the exploding misuse of acronyms. Tex Heart Inst J. 2003;30:255–257. - PMC - PubMed
    1. Pustejovsky J, Castano J, Cochran B. Automatic extraction of acronym-meaning pairs from medline databases. Stud Health Technol Inform. 2001;10:371–375. - PubMed
    1. Ao H, Takagi T. Alice: An Algorithm to Extract Abbreviations from MEDLINE. J AM Med Inform Assoc. 2005;12:576–586. doi: 10.1197/jamia.M1757. - DOI - PMC - PubMed

Publication types

LinkOut - more resources