Standard errors and confidence intervals for variable importance in random forest regression, classification, and survival

Stat Med. 2019 Feb 20;38(4):558-582. doi: 10.1002/sim.7803. Epub 2018 Jun 4.

Abstract

Random forests are a popular nonparametric tree ensemble procedure with broad applications to data analysis. While its widespread popularity stems from its prediction performance, an equally important feature is that it provides a fully nonparametric measure of variable importance (VIMP). A current limitation of VIMP, however, is that no systematic method exists for estimating its variance. As a solution, we propose a subsampling approach that can be used to estimate the variance of VIMP and for constructing confidence intervals. The method is general enough that it can be applied to many useful settings, including regression, classification, and survival problems. Using extensive simulations, we demonstrate the effectiveness of the subsampling estimator and in particular find that the delete-d jackknife variance estimator, a close cousin, is especially effective under low subsampling rates due to its bias correction properties. These 2 estimators are highly competitive when compared with the .164 bootstrap estimator, a modified bootstrap procedure designed to deal with ties in out-of-sample data. Most importantly, subsampling is computationally fast, thus making it especially attractive for big data settings.

Keywords: VIMP; bootstrap; delete-d jackknife; permutation importance; prediction error; subsampling.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Bias*
  • Confidence Intervals*
  • Data Interpretation, Statistical
  • Humans
  • Machine Learning*
  • Models, Statistical
  • Random Allocation
  • Regression Analysis*
  • Statistics as Topic*