A Scalable and Pragmatic Method for the Safe Sharing of High-Quality Health Data

IEEE J Biomed Health Inform. 2018 Mar;22(2):611-622. doi: 10.1109/JBHI.2017.2676880. Epub 2017 Mar 23.

Abstract

The sharing of sensitive personal health data is an important aspect of biomedical research. Methods of data de-identification are often used in this process to trade the granularity of data off against privacy risks. However, traditional approaches, such as HIPAA safe harbor or -anonymization, often fail to provide data with sufficient quality. Alternatively, data can be de-identified only to a degree which still allows us to use it as required, e.g., to carry out specific analyses. Controlled environments, which restrict the ways recipients can interact with the data, can then be used to cope with residual risks. The contributions of this article are twofold. First, we present a method for implementing controlled data sharing environments and analyze its privacy properties. Second, we present a de-identification method which is specifically suited for sanitizing health data which is to be shared in such environments. Traditional de-identification methods control the uniqueness of records in a dataset. The basic idea of our approach is to reduce the probability that a record in a dataset has characteristics which are unique within the underlying population. As the characteristics of the population are typically not known, we have implemented a pragmatic solution in which properties of the population are modeled with statistical methods. We have further developed an accompanying process for evaluating and validating the degree of protection provided. The results of an extensive experimental evaluation show that our approach enables the safe sharing of high-quality data and that it is highly scalable.

MeSH terms

  • Algorithms
  • Biomedical Research
  • Confidentiality*
  • Databases, Factual*
  • Humans
  • Information Dissemination / methods*
  • Medical Records*