Sample size requirements for learning to classify with high-dimensional biomarker panels

Paul McKeigue

doi:10.1177/0962280217738807

Sample size requirements for learning to classify with high-dimensional biomarker panels

Stat Methods Med Res. 2019 Mar;28(3):904-910. doi: 10.1177/0962280217738807. Epub 2017 Nov 28.

Author

Paul McKeigue¹

Affiliation

¹ Usher Institute of Population Health Sciences and Informatics, University of Edinburgh, Edinburgh, UK.

PMID: 29179643
DOI: 10.1177/0962280217738807

Abstract

A common problem in biomedical research is to calculate the sample size required to learn a classifier using a (possibly high-dimensional) panel of biomarkers. This paper describes a simple method based on a Gaussian approximation for calculating the predictive performance of the learned classifier given the size of the biomarker panel, the size of the training sample, and the optimal predictive performance (expressed as C-statistic Copt) of the biomarker panel that could be obtained if a training sample of unlimited size were available. Under the assumption that the biomarker effect sizes have the same correlation structure as the biomarkers, the required sample size does not depend upon these correlations, but only upon Copt and upon the sparsity of the distribution of effect sizes, defined by the proportion of biomarkers that have nonzero effects. To learn a classifier that extracts 80% of the predictive information, the required case sample size varies from about 0.1 cases per variable for a panel with Copt=0.9 and a sparse distribution of effect sizes (such that 1% of biomarkers have nonzero effect sizes) to nine cases per variable for a panel with Copt=0.75 and a diffuse distribution of effect sizes.

Keywords: Bayesian; Sample size; high-dimensional; linear classifier.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Biomarkers*
Biomedical Research / statistics & numerical data
Forecasting*
Normal Distribution
Outcome Assessment, Health Care / trends*
Research Design
Sample Size*

Substances

Biomarkers

Abstract

Publication types

MeSH terms

Substances

Grants and funding