Many studies in biomedical and health sciences involve small sample sizes due to logistic or financial constraints. Often, identifying weak (but scientifically interesting) associations between a set of predictors and a response necessitates pooling datasets from multiple diverse labs or groups. While there is a rich literature in statistical machine learning to address distributional shifts and inference in multi-site datasets, it is less clear when such pooling is guaranteed to help (and when it does not) - independent of the inference algorithms we use. In this paper, we present a hypothesis test to answer this question, both for classical and high dimensional linear regression. We precisely identify regimes where pooling datasets across multiple sites is sensible, and how such policy decisions can be made via simple checks executable on each site before any data transfer ever happens. With a focus on Alzheimer's disease studies, we present empirical results showing that in regimes suggested by our analysis, pooling a local dataset with data from an international study improves power.
Statistical tests and identifiability conditions for pooling and analyzing multisite datasets.Proc Natl Acad Sci U S A. 2018 Feb 13;115(7):1481-1486. doi: 10.1073/pnas.1719747115. Epub 2018 Jan 31. Proc Natl Acad Sci U S A. 2018. PMID: 29386387 Free PMC article.
An algorithm for direct causal learning of influences on patient outcomes.Artif Intell Med. 2017 Jan;75:1-15. doi: 10.1016/j.artmed.2016.10.003. Epub 2016 Nov 5. Artif Intell Med. 2017. PMID: 28363452 Free PMC article.
Empirical study of seven data mining algorithms on different characteristics of datasets for biomedical classification applications.Biomed Eng Online. 2017 Nov 2;16(1):125. doi: 10.1186/s12938-017-0416-x. Biomed Eng Online. 2017. PMID: 29096638 Free PMC article.
Low-Dose Aspirin for the Prevention of Morbidity and Mortality From Preeclampsia: A Systematic Evidence Review for the U.S. Preventive Services Task Force [Internet].Rockville (MD): Agency for Healthcare Research and Quality (US); 2014 Apr. Report No.: 14-05207-EF-1. Agency for Healthcare Research and Quality (US). 2014. PMID: 24783270 Free Books & Documents. Review.
Classical Statistics and Statistical Learning in Imaging Neuroscience.Front Neurosci. 2017 Oct 6;11:543. doi: 10.3389/fnins.2017.00543. eCollection 2017. Front Neurosci. 2017. PMID: 29056896 Free PMC article. Review.