Comparing information extraction techniques for low-prevalence concepts: The case of insulin rejection by patients

Shervin Malmasi; Wendong Ge; Naoshi Hosomura; Alexander Turchin

doi:10.1016/j.jbi.2019.103306

Comparing information extraction techniques for low-prevalence concepts: The case of insulin rejection by patients

J Biomed Inform. 2019 Nov:99:103306. doi: 10.1016/j.jbi.2019.103306. Epub 2019 Oct 13.

Authors

Shervin Malmasi¹, Wendong Ge¹, Naoshi Hosomura¹, Alexander Turchin²

Affiliations

¹ Division of Endocrinology, Brigham and Women's Hospital, Boston, MA, USA; Harvard Medical School, Boston, MA, USA.
² Division of Endocrinology, Brigham and Women's Hospital, Boston, MA, USA; Harvard Medical School, Boston, MA, USA. Electronic address: aturchin@bwh.harvard.edu.

PMID: 31618679
DOI: 10.1016/j.jbi.2019.103306

Abstract

Objective: To comparatively evaluate a range of Natural Language Processing (NLP) approaches for Information Extraction (IE) of low-prevalence concepts in clinical notes on the example of decline of insulin therapy recommendation by patients.

Materials and methods: We evaluated the accuracy of detection of documentation of decline of insulin therapy by patients using sentence-level naïve Bayes, logistic regression and support vector machine (SVM)-based classification (with and without SMOTE oversampling), token-level sequence labelling using conditional random fields (CRFs), uni- and bi-directional recurrent neural network (RNN) models with GRU and LSTM cells, and rule-based detection using Canary platform. All models were trained using the same manually annotated 50,046-document training set and evaluated on the same 1501-document held-out set. Hyperparameter optimization was performed using 10-fold cross-validation.

Results: At the sentence level, prevalence of documentation of decline of insulin therapy by patients was 0.02% in both training and held-out sets. Naïve Bayes and logistic regression models did not achieve F₁ score ≥ 0.5 on the training set and were not further evaluated. Among the other models, evaluation against the held-out test set showed that SVM identified decline of insulin therapy by patients with F₁ score of 0.61, CRF with F₁ of 0.51, RNN with F₁ of 0.67 and Canary rule-based model with F₁ of 0.97.

Conclusions: Identification of low-prevalence concepts can present challenges in medical language processing. Rule-based systems that include the designer's background knowledge of language may be able to achieve higher accuracy under these circumstances.

Keywords: Conditional random fields; Insulin; Natural language processing; Recurrent neural networks; Support vector machine.

MeSH terms

Data Mining / methods*
Diabetes Mellitus / drug therapy
Electronic Health Records*
Humans
Hypoglycemic Agents / therapeutic use
Insulin / therapeutic use*
Natural Language Processing*
Neural Networks, Computer
Support Vector Machine
Treatment Refusal / statistics & numerical data*
User-Computer Interface

Substances

Hypoglycemic Agents
Insulin