De-identification of clinical notes in French: towards a protocol for reference corpus development

J Biomed Inform. 2014 Aug:50:151-61. doi: 10.1016/j.jbi.2013.12.014. Epub 2013 Dec 29.


Background: To facilitate research applying Natural Language Processing to clinical documents, tools and resources are needed for the automatic de-identification of Electronic Health Records.

Objective: This study investigates methods for developing a high-quality reference corpus for the de-identification of clinical documents in French.

Methods: A corpus comprising a variety of clinical document types covering several medical specialties was pre-processed with two automatic de-identification systems from the MEDINA suite of tools: a rule-based system and a system using Conditional Random Fields (CRF). The pre-annotated documents were revised by two human annotators trained to mark ten categories of Protected Health Information (PHI). The human annotators worked independently and were blind to the system that produced the pre-annotations they were revising.The best pre-annotation system was applied to another random selection of 100 documents.After revision by one annotator, this set was used to train a statistical de-identification system.

Results: Two gold standard sets of 100 documents were created based on the consensus of two human revisions of the automatic pre-annotations.The annotation experiment showed that (i) automatic pre-annotation obtained with the rule-based system performed better (F=0.813) than the CRF system (F=0.519), (ii) the human annotators spent more time revising the pre-annotations obtained with the rule-based system (from 102 to 160minutes for 50 documents), compared to the CRF system (from 93 to 142minutes for 50 documents), (iii) the quality of human annotation is higher when pre-annotations are obtained with the rule-based system (F-measure ranging from 0.970 to 0.987), compared to the CRF system (F-measure ranging from 0.914 to 0.981).Finally, only 20 documents from the training set were needed for the statistical system to outperform the pre-annotation systems that were trained on corpora from a medical speciality and hospital different from those in the reference corpus developed herein.

Conclusion: We find that better pre-annotations increase the quality of the reference corpus but require more revision time. A statistical de-identification method outperforms our rule-based system when as little as 20 custom training documents are available.

Keywords: Confidentiality; Electronic Health Records; France; Information Dissemination; Natural Language Processing.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Electronic Health Records*
  • France
  • Humans
  • Natural Language Processing