Impact of De-Identification on Clinical Text Classification Using Traditional and Deep Learning Classifiers

Jihad S Obeid; Paul M Heider; Erin R Weeda; Andrew J Matuskowitz; Christine M Carr; Kevin Gagnon; Tami Crawford; Stephane M Meystre

doi:10.3233/SHTI190228

Impact of De-Identification on Clinical Text Classification Using Traditional and Deep Learning Classifiers

Stud Health Technol Inform. 2019 Aug 21:264:283-287. doi: 10.3233/SHTI190228.

Authors

Jihad S Obeid¹, Paul M Heider¹, Erin R Weeda², Andrew J Matuskowitz³, Christine M Carr^{3

1}, Kevin Gagnon⁴, Tami Crawford¹, Stephane M Meystre¹

Affiliations

¹ Biomedical Informatics Center, Medical University of South Carolina, Charleston, SC, USA.
² Department of Clinical Pharmacy and Outcome Sciences, Medical University of South Carolina, Charleston, SC, USA.
³ Department of Emergency Medicine, Medical University of South Carolina, Charleston, SC, USA.
⁴ Department of Computer Science, University of South Carolina, Columbia, SC, USA.

Abstract

Clinical text de-identification enables collaborative research while protecting patient privacy and confidentiality; however, concerns persist about the reduction in the utility of the de-identified text for information extraction and machine learning tasks. In the context of a deep learning experiment to detect altered mental status in emergency department provider notes, we tested several classifiers on clinical notes in their original form and on their automatically de-identified counterpart. We tested both traditional bag-of-words based machine learning models as well as word-embedding based deep learning models. We evaluated the models on 1,113 history of present illness notes. A total of 1,795 protected health information tokens were replaced in the de-identification process across all notes. The deep learning models had the best performance with accuracies of 95% on both original and de-identified notes. However, there was no significant difference in the performance of any of the models on the original vs. the de-identified notes.

Keywords: Data Anonymization; Machine Learning; Natural Language Processing.

MeSH terms

Confidentiality
Data Anonymization*
Deep Learning*
Electronic Health Records
Humans
Machine Learning

Abstract

MeSH terms

Grants and funding