Large-scale evaluation of automated clinical note de-identification and its impact on information extraction
- PMID: 22859645
- PMCID: PMC3555323
- DOI: 10.1136/amiajnl-2012-001012
Large-scale evaluation of automated clinical note de-identification and its impact on information extraction
Abstract
Objective: (1) To evaluate a state-of-the-art natural language processing (NLP)-based approach to automatically de-identify a large set of diverse clinical notes. (2) To measure the impact of de-identification on the performance of information extraction algorithms on the de-identified documents.
Material and methods: A cross-sectional study that included 3503 stratified, randomly selected clinical notes (over 22 note types) from five million documents produced at one of the largest US pediatric hospitals. Sensitivity, precision, F value of two automated de-identification systems for removing all 18 HIPAA-defined protected health information elements were computed. Performance was assessed against a manually generated 'gold standard'. Statistical significance was tested. The automated de-identification performance was also compared with that of two humans on a 10% subsample of the gold standard. The effect of de-identification on the performance of subsequent medication extraction was measured.
Results: The gold standard included 30 815 protected health information elements and more than one million tokens. The most accurate NLP method had 91.92% sensitivity (R) and 95.08% precision (P) overall. The performance of the system was indistinguishable from that of human annotators (annotators' performance was 92.15%(R)/93.95%(P) and 94.55%(R)/88.45%(P) overall while the best system obtained 92.91%(R)/95.73%(P) on same text). The impact of automated de-identification was minimal on the utility of the narrative notes for subsequent information extraction as measured by the sensitivity and precision of medication name extraction.
Discussion and conclusion: NLP-based de-identification shows excellent performance that rivals the performance of human annotators. Furthermore, unlike manual de-identification, the automated approach scales up to millions of documents quickly and inexpensively.
Conflict of interest statement
Figures
Similar articles
-
BoB, a best-of-breed automated text de-identification system for VHA clinical documents.J Am Med Inform Assoc. 2013 Jan 1;20(1):77-83. doi: 10.1136/amiajnl-2012-001020. Epub 2012 Sep 4. J Am Med Inform Assoc. 2013. PMID: 22947391 Free PMC article.
-
Automated de-identification of free-text medical records.BMC Med Inform Decis Mak. 2008 Jul 24;8:32. doi: 10.1186/1472-6947-8-32. BMC Med Inform Decis Mak. 2008. PMID: 18652655 Free PMC article.
-
Text de-identification for privacy protection: a study of its impact on clinical text information content.J Biomed Inform. 2014 Aug;50:142-50. doi: 10.1016/j.jbi.2014.01.011. Epub 2014 Feb 3. J Biomed Inform. 2014. PMID: 24502938
-
Automatic de-identification of textual documents in the electronic health record: a review of recent research.BMC Med Res Methodol. 2010 Aug 2;10:70. doi: 10.1186/1471-2288-10-70. BMC Med Res Methodol. 2010. PMID: 20678228 Free PMC article. Review.
-
Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1.J Biomed Inform. 2015 Dec;58 Suppl(Suppl):S11-S19. doi: 10.1016/j.jbi.2015.06.007. Epub 2015 Jul 28. J Biomed Inform. 2015. PMID: 26225918 Free PMC article. Review.
Cited by
-
Supporting the working life exposome: Annotating occupational exposure for enhanced literature search.PLoS One. 2024 Aug 15;19(8):e0307844. doi: 10.1371/journal.pone.0307844. eCollection 2024. PLoS One. 2024. PMID: 39146349 Free PMC article.
-
De-identification of free text data containing personal health information: a scoping review of reviews.Int J Popul Data Sci. 2023 Dec 12;8(1):2153. doi: 10.23889/ijpds.v8i1.2153. eCollection 2023. Int J Popul Data Sci. 2023. PMID: 38414537 Free PMC article. Review.
-
Using Clinician-Patient WeChat Group Communication Data to Identify Symptom Burdens in Patients With Uterine Fibroids Under Focused Ultrasound Ablation Surgery Treatment: Qualitative Study.JMIR Form Res. 2023 Sep 1;7:e43995. doi: 10.2196/43995. JMIR Form Res. 2023. PMID: 37656501 Free PMC article.
-
Supporting COVID-19 Disparity Investigations with Dynamically Adjusting Case Reporting Policies.AMIA Annu Symp Proc. 2023 Apr 29;2022:279-288. eCollection 2022. AMIA Annu Symp Proc. 2023. PMID: 37128430 Free PMC article.
-
Investigation of the Utility of Features in a Clinical De-identification Model: A Demonstration Using EHR Pathology Reports for Advanced NSCLC Patients.Front Digit Health. 2022 Feb 16;4:728922. doi: 10.3389/fdgth.2022.728922. eCollection 2022. Front Digit Health. 2022. PMID: 35252956 Free PMC article.
References
-
- Meystre SM, Savova GK, Kipper-Schuler KC, et al. Extracting information from textual documents in the electronic health record: a review of recent research. Yearb Med Inform 2008:128–44 - PubMed
-
- Hicks J. The Potential of Claims Data to Support the Measurement of Health Care Quality. Santa Monica, CA: RAND Corporation, 2003
-
- Jha AK. The promise of electronic records: around the corner or down the road? JAMA 2011;306:880–1 - PubMed
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
