Evaluation of Automated Public De-Identification Tools on a Corpus of Radiology Reports

Jackson M Steinkamp; Taylor Pomeranz; Jason Adleberg; Charles E Kahn Jr; Tessa S Cook

doi:10.1148/ryai.2020190137

Evaluation of Automated Public De-Identification Tools on a Corpus of Radiology Reports

Radiol Artif Intell. 2020 Oct 14;2(6):e190137. doi: 10.1148/ryai.2020190137. eCollection 2020 Nov.

Authors

Jackson M Steinkamp¹, Taylor Pomeranz¹, Jason Adleberg¹, Charles E Kahn Jr¹, Tessa S Cook¹

Affiliation

¹ Department of Radiology, Hospital of the University of Pennsylvania, 3400 Spruce St, Philadelphia, PA 19104 (J.M.S., T.P., J.A., C.E.K., T.S.C.); and Boston University School of Medicine, Boston, Mass (J.M.S.).

Abstract

Purpose: To evaluate publicly available de-identification tools on a large corpus of narrative-text radiology reports.

Materials and methods: In this retrospective study, 21 categories of protected health information (PHI) in 2503 radiology reports were annotated from a large multihospital academic health system, collected between January 1, 2012 and January 8, 2019. A subset consisting of 1023 reports served as a test set; the remainder were used as domain-specific training data. The types and frequencies of PHI present within the reports were tallied. Five public de-identification tools were evaluated: MITRE Identification Scrubber Toolkit, U.S. National Library of Medicine‒Scrubber, Massachusetts Institute of Technology de-identification software, Emory Health Information DE-identification (HIDE) software, and Neuro named-entity recognition (NeuroNER). The tools were compared using metrics including recall, precision, and F1 score (the harmonic mean of recall and precision) for each category of PHI.

Results: The annotators identified 3528 spans of PHI text within the 2503 reports. Cohen κ for interrater agreement was 0.938. Dates accounted for the majority of PHI found in the dataset of radiology reports (n = 2755 [78%]). The two best-performing tools both used machine learning methods-NeuroNER (precision, 94.5%; recall, 92.6%; microaveraged F1 score [F1], 93.6%) and Emory HIDE (precision, 96.6%; recall, 88.2%; F1, 92.2%)-but none exceeded 50% F1 on the important patient names category.

Conclusion: PHI appeared infrequently within the corpus of reports studied, which created difficulties for training machine learning systems. Out-of-the-box de-identification tools achieved limited performance on the corpus of radiology reports, suggesting the need for further advancements in public datasets and trained models.Supplemental material is available for this article.See also the commentary by Tenenholtz and Wood in this issue.© RSNA, 2020.

2020 by the Radiological Society of North America, Inc.