Background: De-identification and anonymization are strategies that are used to remove patient identifiers in electronic health record data. The use of these strategies in multicenter research studies is paramount in importance, given the need to share electronic health record data across multiple environments and institutions while safeguarding patient privacy.
Methods: Systematic literature search using keywords of de-identify, deidentify, de-identification, deidentification, anonymize, anonymization, data scrubbing, and text scrubbing. Search was conducted up to June 30, 2011 and involved 6 different common literature databases. A total of 1798 prospective citations were identified, and 94 full-text articles met the criteria for review and the corresponding articles were obtained. Search results were supplemented by review of 26 additional full-text articles; a total of 120 full-text articles were reviewed.
Results: A final sample of 45 articles met inclusion criteria for review and discussion. Articles were grouped into text, images, and biological sample categories. For text-based strategies, the approaches were segregated into heuristic, lexical, and pattern-based systems versus statistical learning-based systems. For images, approaches that de-identified photographic facial images and magnetic resonance image data were described. For biological samples, approaches that managed the identifiers linked with these samples were discussed, particularly with respect to meeting the anonymization requirements needed for Institutional Review Board exemption under the Common Rule.
Conclusions: Current de-identification strategies have their limitations, and statistical learning-based systems have distinct advantages over other approaches for the de-identification of free text. True anonymization is challenging, and further work is needed in the areas of de-identification of datasets and protection of genetic information.