Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Feb 5;5(1):6.
doi: 10.1186/2041-1480-5-6.

Synonym Extraction and Abbreviation Expansion With Ensembles of Semantic Spaces

Affiliations
Free PMC article

Synonym Extraction and Abbreviation Expansion With Ensembles of Semantic Spaces

Aron Henriksson et al. J Biomed Semantics. .
Free PMC article

Abstract

Background: Terminologies that account for variation in language use by linking synonyms and abbreviations to their corresponding concept are important enablers of high-quality information extraction from medical texts. Due to the use of specialized sub-languages in the medical domain, manual construction of semantic resources that accurately reflect language use is both costly and challenging, often resulting in low coverage. Although models of distributional semantics applied to large corpora provide a potential means of supporting development of such resources, their ability to isolate synonymy from other semantic relations is limited. Their application in the clinical domain has also only recently begun to be explored. Combining distributional models and applying them to different types of corpora may lead to enhanced performance on the tasks of automatically extracting synonyms and abbreviation-expansion pairs.

Results: A combination of two distributional models - Random Indexing and Random Permutation - employed in conjunction with a single corpus outperforms using either of the models in isolation. Furthermore, combining semantic spaces induced from different types of corpora - a corpus of clinical text and a corpus of medical journal articles - further improves results, outperforming a combination of semantic spaces induced from a single source, as well as a single semantic space induced from the conjoint corpus. A combination strategy that simply sums the cosine similarity scores of candidate terms is generally the most profitable out of the ones explored. Finally, applying simple post-processing filtering rules yields substantial performance gains on the tasks of extracting abbreviation-expansion pairs, but not synonyms. The best results, measured as recall in a list of ten candidate terms, for the three tasks are: 0.39 for abbreviations to long forms, 0.33 for long forms to abbreviations, and 0.47 for synonyms.

Conclusions: This study demonstrates that ensembles of semantic spaces can yield improved performance on the tasks of automatically extracting synonyms and abbreviation-expansion pairs. This notion, which merits further exploration, allows different distributional models - with different model parameters - and different types of corpora to be combined, potentially allowing enhanced performance to be obtained on a wide range of natural language processing tasks.

Figures

Figure 1
Figure 1
Ensembles of semantic spaces for synonym extraction and abbreviation expansion. Semantic spaces built with different model parameters are induced from different corpora. The output of the semantic spaces are combined in order to obtain better results compared to using a single semantic space in isolation.
Figure 2
Figure 2
Distribution of candidate terms for the clinical corpus. The distribution (cosine similarity and rank) of candidates for synonyms for the best combination of semantic spaces induced from the clinical corpus. The results show the distribution for query terms in the development reference standard.
Figure 3
Figure 3
Distribution of candidate terms for the medical corpus. The distribution (cosine similarity and rank) of candidates for synonyms for the best combination of semantic spaces induced from the medical corpus. The results show the distribution for query terms in the development reference standard.
Figure 4
Figure 4
Distribution of candidate terms for clinical + medical corpora. The distribution (combined cosine similarity and rank) of candidates for synonyms for the ensemble of semantic spaces induced from medical and clinical corpora. The results show the distribution for query terms in the development reference standard.
Figure 5
Figure 5
Frequency thresholds. The relation between recall and the required minimum frequency of occurrence for the reference standard terms in both corpora. The number of query terms for each threshold value is also shown.

Similar articles

See all similar articles

Cited by 17 articles

See all "Cited by" articles

References

    1. Meystre SM, Savova GK, Kipper-Schuler KC, Hurdle JF. Extracting information from textual documents in the electronic health record: a review of recent research. Yearb Med Inform. 2008;47(1):128–144. - PubMed
    1. Saeed JI. Semantics. Oxford: Blackwell Publishers; 1997.
    1. Leroy G, Chen H. Meeting medical terminology needs-the ontology-enhanced Medical Concept Mapper. IEEE Trans Inf Technol Biomed. 2001;5(4):261–270. - PubMed
    1. Leroy G, Endicott JE, Mouradi O, Kauchak D, Just ML. Proceedings of AMIA Annual Symposium. Maryland, USA: American Medical Informatics Association; 2012. Improving perceived and actual text difficulty for health information consumers using semi-automated methods; pp. 522–31. - PMC - PubMed
    1. Eriksson R, Jensen PB, Frankild S, Jensen LJ, Brunak S. Dictionary construction and identification of possible adverse drug events in Danish clinical narrative text. J Am Med Inform Assoc. 2013;20(5):947–953. doi: 10.1136/amiajnl-2013-001708. - DOI - PMC - PubMed

LinkOut - more resources

Feedback