Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Feb 27;14(1):e33.
doi: 10.2196/jmir.2001.

De-identification Methods for Open Health Data: The Case of the Heritage Health Prize Claims Dataset

Free PMC article

De-identification Methods for Open Health Data: The Case of the Heritage Health Prize Claims Dataset

Khaled El Emam et al. J Med Internet Res. .
Free PMC article


Background: There are many benefits to open datasets. However, privacy concerns have hampered the widespread creation of open health data. There is a dearth of documented methods and case studies for the creation of public-use health data. We describe a new methodology for creating a longitudinal public health dataset in the context of the Heritage Health Prize (HHP). The HHP is a global data mining competition to predict, by using claims data, the number of days patients will be hospitalized in a subsequent year. The winner will be the team or individual with the most accurate model past a threshold accuracy, and will receive a US $3 million cash prize. HHP began on April 4, 2011, and ends on April 3, 2013.

Objective: To de-identify the claims data used in the HHP competition and ensure that it meets the requirements in the US Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule.

Methods: We defined a threshold risk consistent with the HIPAA Privacy Rule Safe Harbor standard for disclosing the competition dataset. Three plausible re-identification attacks that can be executed on these data were identified. For each attack the re-identification probability was evaluated. If it was deemed too high then a new de-identification algorithm was applied to reduce the risk to an acceptable level. We performed an actual evaluation of re-identification risk using simulated attacks and matching experiments to confirm the results of the de-identification and to test sensitivity to assumptions. The main metric used to evaluate re-identification risk was the probability that a record in the HHP data can be re-identified given an attempted attack.

Results: An evaluation of the de-identified dataset estimated that the probability of re-identifying an individual was .0084, below the .05 probability threshold specified for the competition. The risk was robust to violations of our initial assumptions.

Conclusions: It was possible to ensure that the probability of re-identification for a large longitudinal dataset was acceptably low when it was released for a global user community in support of an analytics competition. This is an example of, and methodology for, achieving open data principles for longitudinal health data.

Conflict of interest statement

None declared.


Figure 1
Figure 1
Equations describing how re-identification risk was measured.
Figure 2
Figure 2
The three domain generalization hierarchies for the 3 quasi-identifiers: date of birth (d), gender (g), and visit date (p).
Figure 3
Figure 3
A lattice showing the possible generalizations of the 3 quasi-identifiers: date of birth (d), gender (g), and visit date (p).

Similar articles

See all similar articles

Cited by 9 articles

See all "Cited by" articles


    1. Fienberg SE. Sharing statistical data in the biomedical and health sciences: ethical, institutional, legal, and professional dimensions. Annu Rev Public Health. 1994;15:1–18. doi: 10.1146/annurev.pu.15.050194.000245. - DOI - PubMed
    1. Sztompka P. Trust in science: Robert K Merton's inspirations. J Classical Sociol. 2007;7(2):211–20. doi: 10.1177/1468795x07078038. - DOI
    1. Fienberg SE, Martin ME, Straf ML. Sharing Research Data. Washington, DC: National Academy Press; 1985.
    1. Sieber JE. Data sharing: defining problems and seeking solutions. Law Hum Behav. 1988;12(2):199–206. doi: 10.1007/BF01073128. - DOI
    1. Hedrick T. Justifications for the sharing of social science data. Law Hum Behav. 1988;12(2):163–71. doi: 10.1007/BF01073124. - DOI

Publication types

MeSH terms