Standard errors and confidence intervals for variable importance in random forest regression, classification, and survival

Hemant Ishwaran; Min Lu

doi:10.1002/sim.7803

Standard errors and confidence intervals for variable importance in random forest regression, classification, and survival

Stat Med. 2019 Feb 20;38(4):558-582. doi: 10.1002/sim.7803. Epub 2018 Jun 4.

Authors

Hemant Ishwaran¹, Min Lu¹

Affiliation

¹ Division of Biostatistics, Miller School of Medicine, University of Miami, Miami, Florida, USA.

Abstract

Random forests are a popular nonparametric tree ensemble procedure with broad applications to data analysis. While its widespread popularity stems from its prediction performance, an equally important feature is that it provides a fully nonparametric measure of variable importance (VIMP). A current limitation of VIMP, however, is that no systematic method exists for estimating its variance. As a solution, we propose a subsampling approach that can be used to estimate the variance of VIMP and for constructing confidence intervals. The method is general enough that it can be applied to many useful settings, including regression, classification, and survival problems. Using extensive simulations, we demonstrate the effectiveness of the subsampling estimator and in particular find that the delete-d jackknife variance estimator, a close cousin, is especially effective under low subsampling rates due to its bias correction properties. These 2 estimators are highly competitive when compared with the .164 bootstrap estimator, a modified bootstrap procedure designed to deal with ties in out-of-sample data. Most importantly, subsampling is computationally fast, thus making it especially attractive for big data settings.

Keywords: VIMP; bootstrap; delete-d jackknife; permutation importance; prediction error; subsampling.

Publication types

Research Support, N.I.H., Extramural

MeSH terms

Bias*
Confidence Intervals*
Data Interpretation, Statistical
Humans
Machine Learning*
Models, Statistical
Random Allocation
Regression Analysis*
Statistics as Topic*

Abstract

Publication types

MeSH terms

Grants and funding