Purpose: Privacy-protecting analytic and data-sharing methods that minimize the disclosure risk of sensitive information are increasingly important due to the growing interest in utilizing data across multiple sources. We conducted a simulation study to examine how avoiding sharing individual-level data in a distributed data network can affect analytic results.
Methods: The base scenario had four sites of varying sizes with 5% outcome incidence, 50% treatment prevalence, and seven confounders. We varied treatment prevalence, outcome incidence, treatment effect, site size, number of sites, and covariate distribution. Confounding adjustment was conducted using propensity score or disease risk score. We compared analyses of three types of aggregate-level data requested from sites: risk-set, summary-table, or effect-estimate data (meta-analysis) with benchmark results of analysis of pooled individual-level data. We assessed bias and precision of hazard ratio estimates as well as the accuracy of standard error estimates.
Results: All the aggregate-level data-sharing approaches, regardless of confounding adjustment methods, successfully approximated pooled individual-level data analysis in most simulation scenarios. Meta-analysis showed minor bias when using inverse probability of treatment weights (IPTW) in infrequent exposure (5%), rare outcome (0.01%), and small site (5,000 patients) settings. SE estimates became less accurate for IPTW risk-set approach with less frequent exposure and for propensity score-matching meta-analysis approach with rare outcomes.
Conclusions: Overall, we found that we can avoid sharing individual-level data and obtain valid results in many settings, although care must be taken with meta-analysis approach in infrequent exposure and rare outcome scenarios, particularly when confounding adjustment is performed with IPTW.
Keywords: distributed databases; pharmacoepidemiology; privacy-protecting methods; propensity score.
© 2018 John Wiley & Sons, Ltd.