Modern machine learning (ML) technologies have great promise for automating diverse clinical and research workflows; however, training them requires extensive hand-labelled datasets. Disambiguating abbreviations is important for automated clinical note processing; however, broad deployment of ML for this task is restricted by the scarcity and imbalance of labeled training data. In this work we present a method that improves a model's ability to generalize through novel data augmentation techniques that utilizes information from biomedical ontologies in the form of related medical concepts, as well as global context information within the medical note. We train our model on a public dataset (MIMIC III) and test its performance on automatically generated and hand-labelled datasets from different sources (MIMIC III, CASI, i2b2). Together, these techniques boost the accuracy of abbreviation disambiguation by up to 17% on hand-labeled data, without sacrificing performance on a held-out test set from MIMIC III.
© 2021. Crown.