Sample size requirements for training high-dimensional risk predictors

Biostatistics. 2013 Sep;14(4):639-52. doi: 10.1093/biostatistics/kxt022. Epub 2013 Jul 19.

Abstract

A common objective of biomarker studies is to develop a predictor of patient survival outcome. Determining the number of samples required to train a predictor from survival data is important for designing such studies. Existing sample size methods for training studies use parametric models for the high-dimensional data and cannot handle a right-censored dependent variable. We present a new training sample size method that is non-parametric with respect to the high-dimensional vectors, and is developed for a right-censored response. The method can be applied to any prediction algorithm that satisfies a set of conditions. The sample size is chosen so that the expected performance of the predictor is within a user-defined tolerance of optimal. The central method is based on a pilot dataset. To quantify uncertainty, a method to construct a confidence interval for the tolerance is developed. Adequacy of the size of the pilot dataset is discussed. An alternative model-based version of our method for estimating the tolerance when no adequate pilot dataset is available is presented. The model-based method requires a covariance matrix be specified, but we show that the identity covariance matrix provides adequate sample size when the user specifies three key quantities. Application of the sample size method to two microarray datasets is discussed.

Keywords: Conditional score; Cox regression; High-dimensional data; Risk prediction; Sample size; Training set.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't
  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Biomarkers / analysis*
  • Breast Neoplasms / genetics
  • Computer Simulation
  • Data Interpretation, Statistical*
  • Female
  • Humans
  • Oligonucleotide Array Sequence Analysis
  • Ovarian Neoplasms / genetics
  • Predictive Value of Tests*
  • Proportional Hazards Models*
  • Sample Size*

Substances

  • Biomarkers