On the overestimation of random forest's out-of-bag error
- PMID: 30080866
- PMCID: PMC6078316
- DOI: 10.1371/journal.pone.0201904
On the overestimation of random forest's out-of-bag error
Abstract
The ensemble method random forests has become a popular classification tool in bioinformatics and related fields. The out-of-bag error is an error estimation technique often used to evaluate the accuracy of a random forest and to select appropriate values for tuning parameters, such as the number of candidate predictors that are randomly drawn for a split, referred to as mtry. However, for binary classification problems with metric predictors it has been shown that the out-of-bag error can overestimate the true prediction error depending on the choices of random forests parameters. Based on simulated and real data this paper aims to identify settings for which this overestimation is likely. It is, moreover, questionable whether the out-of-bag error can be used in classification tasks for selecting tuning parameters like mtry, because the overestimation is seen to depend on the parameter mtry. The simulation-based and real-data based studies with metric predictor variables performed in this paper show that the overestimation is largest in balanced settings and in settings with few observations, a large number of predictor variables, small correlations between predictors and weak effects. There was hardly any impact of the overestimation on tuning parameter selection. However, although the prediction performance of random forests was not substantially affected when using the out-of-bag error for tuning parameter selection in the present studies, one cannot be sure that this applies to all future data. For settings with metric predictor variables it is therefore strongly recommended to use stratified subsampling with sampling fractions that are proportional to the class sizes for both tuning parameter selection and error estimation in random forests. This yielded less biased estimates of the true prediction error. In unbalanced settings, in which there is a strong interest in predicting observations from the smaller classes well, sampling the same number of observations from each class is a promising alternative.
Conflict of interest statement
The authors have declared that no competing interests exist.
Figures
Similar articles
-
SNP selection and classification of genome-wide SNP data using stratified sampling random forests.IEEE Trans Nanobioscience. 2012 Sep;11(3):216-27. doi: 10.1109/TNB.2012.2214232. IEEE Trans Nanobioscience. 2012. PMID: 22987127
-
The parameter sensitivity of random forests.BMC Bioinformatics. 2016 Sep 1;17(1):331. doi: 10.1186/s12859-016-1228-x. BMC Bioinformatics. 2016. PMID: 27586051 Free PMC article.
-
Bias in random forest variable importance measures: illustrations, sources and a solution.BMC Bioinformatics. 2007 Jan 25;8:25. doi: 10.1186/1471-2105-8-25. BMC Bioinformatics. 2007. PMID: 17254353 Free PMC article.
-
Intervention in prediction measure: a new approach to assessing variable importance for random forests.BMC Bioinformatics. 2017 May 2;18(1):230. doi: 10.1186/s12859-017-1650-8. BMC Bioinformatics. 2017. PMID: 28464827 Free PMC article.
-
Class-imbalanced classifiers for high-dimensional data.Brief Bioinform. 2013 Jan;14(1):13-26. doi: 10.1093/bib/bbs006. Epub 2012 Mar 9. Brief Bioinform. 2013. PMID: 22408190 Review.
Cited by
-
Nuclear magnetic resonance-based metabolomics with machine learning for predicting progression from prediabetes to diabetes.Elife. 2024 Sep 20;13:RP98709. doi: 10.7554/eLife.98709. Elife. 2024. PMID: 39302270 Free PMC article.
-
Predictive Feature Generation and Selection Using Process Data From PISA Interactive Problem-Solving Items: An Application of Random Forests.Front Psychol. 2019 Nov 21;10:2461. doi: 10.3389/fpsyg.2019.02461. eCollection 2019. Front Psychol. 2019. PMID: 31824363 Free PMC article.
-
Identifying Liars Through Automatic Decoding of Children's Facial Expressions.Child Dev. 2020 Jul;91(4):e995-e1011. doi: 10.1111/cdev.13336. Epub 2019 Nov 4. Child Dev. 2020. PMID: 31682003 Free PMC article.
-
Using phenomics to identify and integrate traits of interest for better-performing common beans: A validation study on an interspecific hybrid and its Acutifolii parents.Front Plant Sci. 2022 Dec 8;13:1008666. doi: 10.3389/fpls.2022.1008666. eCollection 2022. Front Plant Sci. 2022. PMID: 36570940 Free PMC article.
-
Analysis of the association between vestibular schwannoma and hearing status using a newly developed radiomics technique.Eur Arch Otorhinolaryngol. 2024 Jun;281(6):2951-2957. doi: 10.1007/s00405-023-08410-1. Epub 2024 Jan 6. Eur Arch Otorhinolaryngol. 2024. PMID: 38183454
References
-
- Breiman L. Random forests. Mach Learn. 2001;45(1):5–32. 10.1023/A:1010933404324 - DOI
-
- Bylander T. Estimating generalization error on two-class datasets using out-of-bag estimates. Mach Learn. 2002;48(1-3):287–297. 10.1023/A:1013964023376 - DOI
-
- Zhang GY, Zhang CX, Zhang JS. Out-of-bag estimation of the optimal hyperparameter in SubBag ensemble method. Commun Stat Simul Comput. 2010;39(10):1877–1892. 10.1080/03610918.2010.521277 - DOI
-
- Mitchell MW. Bias of the Random Forest out-of-bag (OOB) error for certain input parameters. Open J Stat. 2011;1(3):205–211. 10.4236/ojs.2011.13024 - DOI
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
