Fast max-margin clustering for unsupervised word sense disambiguation in biomedical texts

Weisi Duan; Min Song; Alexander Yates

doi:10.1186/1471-2105-10-S3-S4

Fast max-margin clustering for unsupervised word sense disambiguation in biomedical texts

BMC Bioinformatics. 2009 Mar 19;10 Suppl 3(Suppl 3):S4. doi: 10.1186/1471-2105-10-S3-S4.

Authors

Weisi Duan¹, Min Song, Alexander Yates

Affiliation

¹ Department of Computer and Information Sciences, Temple University, Philadelphia, PA 19122, USA. weisi.duan@temple.edu

Abstract

Background: We aim to solve the problem of determining word senses for ambiguous biomedical terms with minimal human effort.

Methods: We build a fully automated system for Word Sense Disambiguation by designing a system that does not require manually-constructed external resources or manually-labeled training examples except for a single ambiguous word. The system uses a novel and efficient graph-based algorithm to cluster words into groups that have the same meaning. Our algorithm follows the principle of finding a maximum margin between clusters, determining a split of the data that maximizes the minimum distance between pairs of data points belonging to two different clusters.

Results: On a test set of 21 ambiguous keywords from PubMed abstracts, our system has an average accuracy of 78%, outperforming a state-of-the-art unsupervised system by 2% and a baseline technique by 23%. On a standard data set from the National Library of Medicine, our system outperforms the baseline by 6% and comes within 5% of the accuracy of a supervised system.

Conclusion: Our system is a novel, state-of-the-art technique for efficiently finding word sense clusters, and does not require training data or human effort for each new word to be disambiguated.

Publication types

Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Algorithms*
Cluster Analysis
Computational Biology / methods
Humans
Information Storage and Retrieval / methods
Pattern Recognition, Automated / methods*
PubMed
Vocabulary, Controlled