De-identification of Clinical Text via Bi-LSTM-CRF with Neural Language Models

AMIA Annu Symp Proc. 2020 Mar 4:2019:857-863. eCollection 2019.

Abstract

De-identification of clinical text, the prerequisite of electronic clinical data reuse, is a typical named entity recogni tion (NER) problem. A number of state-of-the-art deep learning methods for NER, such as Bi-LSTM-CRF (bidirec tional long-short-term-memory conditional random fields), have been applied for de-identification. Neural language models used for language representation bring great improvement in lots of NLP tasks when they are integrated with other deep learning methods. In this paper, we introduce Bi-LSTM-CRF with neural language models for de- identification of clinical text, and evaluate it on the de-identification datasets of the i2b2 2014 and the CEGS N- GRID 2016 challenges. Four neural language models of three types individually integrated with Bi-LSTM-CRF are compared in this study. Bi-LSTM-CRF with neural language models achieves the highest "strict" micro-averaged F1-score of 95.50% on the i2b2 2014 dataset and 91.82% on the CEGS N-GRID 2016 dataset, becoming new benchmark results on these two datasets respectively Keywords: De-identification, Named entity recognition, Bidirectional long-short-term-memory, Conditional ran dom fields, Neural language models.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Data Anonymization*
  • Deep Learning
  • Language
  • Natural Language Processing*
  • Neural Networks, Computer*