A method for constructing a confidence bound for the actual error rate of a prediction rule in high dimensions

Kevin K Dobbin

doi:10.1093/biostatistics/kxn035

A method for constructing a confidence bound for the actual error rate of a prediction rule in high dimensions

Biostatistics. 2009 Apr;10(2):282-96. doi: 10.1093/biostatistics/kxn035. Epub 2008 Nov 27.

Author

Kevin K Dobbin¹

Affiliation

¹ National Cancer Institute, 6130 Executive Boulevard, EPN Room 8124, Rockville, MD 20892, USA. dobbinke@mail.nih.gov

Abstract

Constructing a confidence interval for the actual, conditional error rate of a prediction rule from multivariate data is problematic because this error rate is not a population parameter in the traditional sense--it is a functional of the training set. When the training set changes, so does this "parameter." A valid method for constructing confidence intervals for the actual error rate had been previously developed by McLachlan. However, McLachlan's method cannot be applied in many cancer research settings because it requires the number of samples to be much larger than the number of dimensions (n >> p), and it assumes that no dimension-reducing feature selection step is performed. Here, an alternative to McLachlan's method is presented that can be applied when p >> n, with an additional adjustment in the presence of feature selection. Coverage probabilities of the new method are shown to be nominal or conservative over a wide range of scenarios. The new method is relatively simple to implement and not computationally burdensome.

MeSH terms

Biomedical Research / methods
Computer Simulation
Confidence Intervals*
Data Interpretation, Statistical*
Humans
Models, Statistical*
Monte Carlo Method
Selection Bias