Objectives: Polysemy is a frequent issue in biomedical terminologies. In the Unified Medical Language System (UMLS), polysemous terms are either represented as several independent concepts, or clustered into a single, multiply-categorized concept. The objective of this study is to analyze polysemous concepts in the UMLS through their categorization and hierarchical relations for auditing purposes.
Methods: We used the association of a concept with multiple Semantic Groups (SGs) as a surrogate for polysemy. We first extracted multi-SG (MSG) concepts from the UMLS Metathesaurus and characterized them in terms of the combinations of SGs with which they are associated. We then clustered MSG concepts in order to identify major types of polysemy. We also analyzed the inheritance of SGs in MSG concepts. Finally, we manually reviewed the categorization of the MSG concepts for auditing purposes.
Results: The 1208 MSG concepts in the Metathesaurus are associated with 30 distinct pairs of SGs. We created 75 semantically homogeneous clusters of MSG concepts, and 276 MSG concepts could not be clustered for lack of hierarchical relations. The clusters were characterized by the most frequent pairs of semantic types of their constituent MSG concepts. MSG concepts exhibit limited semantic compatibility with their parent and child concepts. A large majority of MSG concepts (92%) are adequately categorized. Examples of miscategorized concepts are presented.
Conclusion: This work is a systematic analysis and manual review of all concepts categorized by multiple SGs in the UMLS. The correctly-categorized MSG concepts do reflect polysemy in the UMLS Metathesaurus. The analysis of inheritance of SGs proved useful for auditing concept categorization in the UMLS.