Background: Contemporary bioscience sometimes demands vast sample sizes and there is often then no choice but to synthesize data across several studies and to undertake an appropriate pooled analysis. This same need is also faced in health-services and socio-economic research. When a pooled analysis is required, analytic efficiency and flexibility are often best served by combining the individual-level data from all sources and analysing them as a single large data set. But ethico-legal constraints, including the wording of consent forms and privacy legislation, often prohibit or discourage the sharing of individual-level data, particularly across national or other jurisdictional boundaries. This leads to a fundamental conflict in competing public goods: individual-level analysis is desirable from a scientific perspective, but is prevented by ethico-legal considerations that are entirely valid.
Methods: Data aggregation through anonymous summary-statistics from harmonized individual-level databases (DataSHIELD), provides a simple approach to analysing pooled data that circumvents this conflict. This is achieved via parallelized analysis and modern distributed computing and, in one key setting, takes advantage of the properties of the updating algorithm for generalized linear models (GLMs).
Results: The conceptual use of DataSHIELD is illustrated in two different settings.
Conclusions: As the study of the aetiological architecture of chronic diseases advances to encompass more complex causal pathways-e.g. to include the joint effects of genes, lifestyle and environment-sample size requirements will increase further and the analysis of pooled individual-level data will become ever more important. An aim of this conceptual article is to encourage others to address the challenges and opportunities that DataSHIELD presents, and to explore potential extensions, for example to its use when different data sources hold different data on the same individuals.