[The risk of re-identification when analyzing electronic health records: a critical appraisal and possible solutions]

Z Evid Fortbild Qual Gesundhwes. 2019 Dec:149:22-31. doi: 10.1016/j.zefq.2020.01.002. Epub 2020 Mar 10.
[Article in German]


Background and objectives: The use of primary care data gathered from electronic health records in local practices could be an important building block for the future of health services research. However, the risks and reservations associated with using this data for research purposes should not be underestimated. We show the data protection and privacy problems that may arise through secondary analysis of routine primary care data and describe the technical solutions that are available to address these concerns - as a trust-building measure.

Methods: We screened 40 variables that are deemed important for documentation in the electronic health records of primary care physicians and rated the risk of patient re-identification when using these records from routine medical data for research purposes. The criteria used to rate the risk of re-identification were "expert perception" (inferences of a professional observer of phenotypical characteristics which are documented in the 40 variables), "researchable additional knowledge" (knowledge of characteristics of a person through publicly available information and social media networks), and "statistic frequency" according to diagnosis and medication statistics.

Results: Diagnoses and reasons for contacting a general practitioner can contain particularly identifiable characteristics such as "obesity" (ICD-10 E66) and "nicotine dependence" (F17). About half of all ICD codes documented in primary care fall below a critical threshold value in their absolute frequency; this is all the more problematic if diagnoses allow for re-identification due to phenotypical characteristics. Medication information holds little potential risk of re-identification of a person. However, the application of medications could be a source of re-identification, e. g., self-injections of insulin or use of inhalators. Information about times and dates are especially sensitive for the re-identification of a person. Sex and age of a patient generally pose no problems, except in the case of very young or very old individuals when these age groups are seldom represented in the practice.

Discussion: Routine health data are, in principle, sensitive data. Knowledge about the variables in primary care data gathered from electronic health records in local practices and the evaluation of this data allow us to more accurately estimate the risk of re-identification for the persons concerned. In particular, chronic diagnoses and/or diagnoses in long text, calendar dates for patient contacts and therapies bear a high risk of re-identification. Technical measures such as removing data, masking values and coding should make re-identification considerably more difficult. There will always be a remaining risk of re-identification which should be openly discussed to counteract concerns about a lack of data protection or a sweeping critique of digitization in healthcare.

Keywords: Allgemeinmedizin; Data anonymization; Datenanonymisierung; Datenschutz; Electronic health records; Elektronische Patientenakten; Family practice; Primärversorgung; Privacy of patient data; Quality of healthcare; Qualitätssicherung in der medizinischen Versorgung; Statistics and numerical data in primary healthcare; Statistische Datengrundlagen der ambulanten.

MeSH terms

  • Delivery of Health Care
  • Electronic Health Records*
  • General Practitioners*
  • Germany
  • Humans
  • Primary Health Care*
  • Research Design*
  • Risk