Standard errors and confidence intervals for variable importance in random forest regression, classification, and survival
- PMID: 29869423
- PMCID: PMC6279615
- DOI: 10.1002/sim.7803
Standard errors and confidence intervals for variable importance in random forest regression, classification, and survival
Abstract
Random forests are a popular nonparametric tree ensemble procedure with broad applications to data analysis. While its widespread popularity stems from its prediction performance, an equally important feature is that it provides a fully nonparametric measure of variable importance (VIMP). A current limitation of VIMP, however, is that no systematic method exists for estimating its variance. As a solution, we propose a subsampling approach that can be used to estimate the variance of VIMP and for constructing confidence intervals. The method is general enough that it can be applied to many useful settings, including regression, classification, and survival problems. Using extensive simulations, we demonstrate the effectiveness of the subsampling estimator and in particular find that the delete-d jackknife variance estimator, a close cousin, is especially effective under low subsampling rates due to its bias correction properties. These 2 estimators are highly competitive when compared with the .164 bootstrap estimator, a modified bootstrap procedure designed to deal with ties in out-of-sample data. Most importantly, subsampling is computationally fast, thus making it especially attractive for big data settings.
Keywords: VIMP; bootstrap; delete-d jackknife; permutation importance; prediction error; subsampling.
Copyright © 2018 John Wiley & Sons, Ltd.
Figures
Similar articles
-
The efficiency of different search strategies in estimating parsimony jackknife, bootstrap, and Bremer support.BMC Evol Biol. 2005 Oct 29;5:58. doi: 10.1186/1471-2148-5-58. BMC Evol Biol. 2005. PMID: 16255783 Free PMC article.
-
Cox regression model with doubly truncated data.Biometrics. 2018 Jun;74(2):725-733. doi: 10.1111/biom.12809. Epub 2017 Oct 26. Biometrics. 2018. PMID: 29073330 Free PMC article.
-
Calculating confidence intervals for prediction error in microarray classification using resampling.Stat Appl Genet Mol Biol. 2008;7(1):Article8. doi: 10.2202/1544-6115.1322. Epub 2008 Mar 1. Stat Appl Genet Mol Biol. 2008. PMID: 18312213
-
The bootstrap: a technique for data-driven statistics. Using computer-intensive analyses to explore experimental data.Clin Chim Acta. 2005 Sep;359(1-2):1-26. doi: 10.1016/j.cccn.2005.04.002. Clin Chim Acta. 2005. PMID: 15936746 Review.
-
Use and misuse of random forest variable importance metrics in medicine: demonstrations through incident stroke prediction.BMC Med Res Methodol. 2023 Jun 19;23(1):144. doi: 10.1186/s12874-023-01965-x. BMC Med Res Methodol. 2023. PMID: 37337173 Free PMC article. Review.
Cited by
-
Concurrent TP53 and CDKN2A Gene Aberrations in Newly Diagnosed Mantle Cell Lymphoma Correlate with Chemoresistance and Call for Innovative Upfront Therapy.Cancers (Basel). 2020 Jul 31;12(8):2120. doi: 10.3390/cancers12082120. Cancers (Basel). 2020. PMID: 32751805 Free PMC article.
-
Identification of a multidimensional transcriptome signature for survival prediction of postoperative glioblastoma multiforme patients.J Transl Med. 2018 Dec 20;16(1):368. doi: 10.1186/s12967-018-1744-8. J Transl Med. 2018. PMID: 30572911 Free PMC article.
-
Variables of importance in the Scientific Registry of Transplant Recipients database predictive of heart transplant waitlist mortality.Am J Transplant. 2019 Jul;19(7):2067-2076. doi: 10.1111/ajt.15265. Epub 2019 Feb 13. Am J Transplant. 2019. PMID: 30659754 Free PMC article.
-
Human Behavior Recognition Model Based on Feature and Classifier Selection.Sensors (Basel). 2021 Nov 23;21(23):7791. doi: 10.3390/s21237791. Sensors (Basel). 2021. PMID: 34883795 Free PMC article.
-
Machine Learning of Plasma Proteomics Classifies Diagnosis of Interstitial Lung Disease.Am J Respir Crit Care Med. 2024 Aug 15;210(4):444-454. doi: 10.1164/rccm.202309-1692OC. Am J Respir Crit Care Med. 2024. PMID: 38422478
References
-
- Breiman L. Random forests. Machine Learning. 2001;45:5–32.
-
- Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS. Random survival forests. The Annals of Applied Statistics. 2008;2(3):841–860.
-
- Hothorn T, Hornik K, Zeileis A. Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics. 2006;15(3):651–674.
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
Medical
Research Materials