Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 70, 4170-4179

When Can Multi-Site Datasets Be Pooled for Regression? Hypothesis Tests, 2-consistency and Neuroscience Applications

Affiliations

When Can Multi-Site Datasets Be Pooled for Regression? Hypothesis Tests, 2-consistency and Neuroscience Applications

Hao Henry Zhou et al. Proc Mach Learn Res.

Abstract

Many studies in biomedical and health sciences involve small sample sizes due to logistic or financial constraints. Often, identifying weak (but scientifically interesting) associations between a set of predictors and a response necessitates pooling datasets from multiple diverse labs or groups. While there is a rich literature in statistical machine learning to address distributional shifts and inference in multi-site datasets, it is less clear when such pooling is guaranteed to help (and when it does not) - independent of the inference algorithms we use. In this paper, we present a hypothesis test to answer this question, both for classical and high dimensional linear regression. We precisely identify regimes where pooling datasets across multiple sites is sensible, and how such policy decisions can be made via simple checks executable on each site before any data transfer ever happens. With a focus on Alzheimer's disease studies, we present empirical results showing that in regimes suggested by our analysis, pooling a local dataset with data from an international study improves power.

Figures

Figure 1.
Figure 1.
β1 and β2 are 1st and 2nd site coefficients. After combination, β1’sbias increases but variance reduces, resulting in a smaller MSE.
Figure 2.
Figure 2.
X and Z influence the response Y . After adjustment, X1 and X2 may be close requiring same β However, Z1 and Z2 may differ a lot, and we need different γ1 and γ2.
Figure 3.
Figure 3.
(a,d) β^'s MSE and the acceptance rate (Sec 2.1), (b,e) MSE of β^ and γ^1, and the acceptance rate (Sec 2.2) using 100 bootstrap repetitions. Solid line in (d,e) is when the condition from Theorem 2.3 is 1. Dotted line is when MSE of single-site and multi-site models are the same. (c) λ error path when sparsity patterns are dissimilar across sites, (f) The regime where sparsity patters are similar.
Figure 4.
Figure 4.
(a,c) MPE for the pooled regression model after/before transformations (green/red) compared to baseline (blue) plotted against training subset size of ADNI. x-axis is number/fraction of ADNI labeled samples used in training (apart from ADlocal). (b,d) show the acceptance rates for (a,c). Unlike in (a), (c) restricts same training data size for ADNI and ADlocal

Similar articles

See all similar articles

LinkOut - more resources

Feedback