A method for constructing a confidence bound for the actual error rate of a prediction rule in high dimensions

Biostatistics. 2009 Apr;10(2):282-96. doi: 10.1093/biostatistics/kxn035. Epub 2008 Nov 27.

Abstract

Constructing a confidence interval for the actual, conditional error rate of a prediction rule from multivariate data is problematic because this error rate is not a population parameter in the traditional sense--it is a functional of the training set. When the training set changes, so does this "parameter." A valid method for constructing confidence intervals for the actual error rate had been previously developed by McLachlan. However, McLachlan's method cannot be applied in many cancer research settings because it requires the number of samples to be much larger than the number of dimensions (n >> p), and it assumes that no dimension-reducing feature selection step is performed. Here, an alternative to McLachlan's method is presented that can be applied when p >> n, with an additional adjustment in the presence of feature selection. Coverage probabilities of the new method are shown to be nominal or conservative over a wide range of scenarios. The new method is relatively simple to implement and not computationally burdensome.

MeSH terms

  • Biomedical Research / methods
  • Computer Simulation
  • Confidence Intervals*
  • Data Interpretation, Statistical*
  • Humans
  • Models, Statistical*
  • Monte Carlo Method
  • Selection Bias