Is Multiclass Automatic Text De-Identification Worth the Effort?

Methods Inf Med. 2018 Sep;57(4):177-184. doi: 10.3414/ME18-01-0017. Epub 2018 Sep 24.

Abstract

Objectives: Automatic de-identification to remove protected health information (PHI) from clinical text can use a "binary" model that replaces redacted text with a generic tag (e.g., "<PHI>"), or can use a "multiclass" model that retains more class information (e.g., "<Phone Number>"). Binary models are easier to develop, but result in text that is potentially less informative. We investigated whether building a multiclass de-identification is worth the extra effort.

Methods: Using the 2014 i2b2 dataset, we compared the accuracy and impact on document readability of two models. In the first experiment, we generated one binary and two multiclass versions trained with the same machine-learning algorithm Conditional Random Field (CRF). Accuracy (recall, precision, f-score) and secondary metrics (e.g, training time, testing time, minimum memory required) were measured. In the second experiment, three reviewers accessed the readability of two redacted documents using the binary and multiclass methods. We estimated a pooled Kappa to estimate the inter-rater agreement.

Results: The multiclass model did not demonstrate a clear accuracy advantage, with lower recall (-1.9%) and only slightly better precision (+0.6%), despite requiring additional computing resources. Three raters reached a very high agreement (Kappa = 0.975, 95% Confidence Interval (0.946, 1.00), p < 0.0001) that both binary and multiclass models have the same impact on document readability.

Conclusions: This study suggests that the development of more sophisticated classification of PHI may not be worth the effort in terms of both system accuracy and the usefulness of the output.

MeSH terms

  • Algorithms
  • Data Anonymization*
  • Electronic Health Records*
  • False Positive Reactions
  • Humans