Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Aug 6;13(8):e0201904.
doi: 10.1371/journal.pone.0201904. eCollection 2018.

On the overestimation of random forest's out-of-bag error

Affiliations

On the overestimation of random forest's out-of-bag error

Silke Janitza et al. PLoS One. .

Abstract

The ensemble method random forests has become a popular classification tool in bioinformatics and related fields. The out-of-bag error is an error estimation technique often used to evaluate the accuracy of a random forest and to select appropriate values for tuning parameters, such as the number of candidate predictors that are randomly drawn for a split, referred to as mtry. However, for binary classification problems with metric predictors it has been shown that the out-of-bag error can overestimate the true prediction error depending on the choices of random forests parameters. Based on simulated and real data this paper aims to identify settings for which this overestimation is likely. It is, moreover, questionable whether the out-of-bag error can be used in classification tasks for selecting tuning parameters like mtry, because the overestimation is seen to depend on the parameter mtry. The simulation-based and real-data based studies with metric predictor variables performed in this paper show that the overestimation is largest in balanced settings and in settings with few observations, a large number of predictor variables, small correlations between predictors and weak effects. There was hardly any impact of the overestimation on tuning parameter selection. However, although the prediction performance of random forests was not substantially affected when using the out-of-bag error for tuning parameter selection in the present studies, one cannot be sure that this applies to all future data. For settings with metric predictor variables it is therefore strongly recommended to use stratified subsampling with sampling fractions that are proportional to the class sizes for both tuning parameter selection and error estimation in random forests. This yielded less biased estimates of the true prediction error. In unbalanced settings, in which there is a strong interest in predicting observations from the smaller classes well, sampling the same number of observations from each class is a promising alternative.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Error rate estimates for the binary null case study (balanced).
Shown are different error rate estimates for the setting with two response classes of equal size and without any predictors with effect. The error rate was estimated through the test error, the OOB error, the stratified OOB error, the CV error, and the stratified CV error for settings with different sample sizes, n, and numbers of predictors, p. The mean error rate over 500 repetitions was obtained for a range of mtry values. The vertical grey dashed line in each plot indicates the most commonly used default choice for mtry in classification tasks, that is p.
Fig 2
Fig 2. Error rate estimates for the binary power case study (balanced).
Shown are different error rate estimates for the setting with two response classes of equal size and with both predictors with effect and without effect. The error rate was estimated through the test error, the OOB error, the stratified OOB error, the CV error, and the stratified CV error for settings with different sample sizes, n, and numbers of predictors, p. The mean error rate over 500 repetitions was obtained for a range of mtry values. The vertical grey dashed line in each plot indicates the most commonly used default choice for mtry in classification tasks, that is p.
Fig 3
Fig 3. Error rate estimates for the binary null case study (unbalanced).
Shown are different error rate estimates for the setting with two response classes of unequal size (smaller class containing 30% of the observations) and without any predictors with effect. The error rate was estimated through the test error, the OOB error, the stratified OOB error, the CV error, and the stratified CV error for settings with different sample sizes, n, and numbers of predictors, p. The mean error rate over 500 repetitions was obtained for a range of mtry values. The vertical grey dashed line in each plot indicates the most commonly used default choice for mtry in classification tasks, that is p.
Fig 4
Fig 4. Error rate estimates for the binary power case study (unbalanced).
Shown are different error rate estimates for the setting with two response classes of unequal size (smaller class containing 30% of the observations) and with both predictors with effect and without effect. The error rate was estimated through the test error, the OOB error, the stratified OOB error, the CV error, and the stratified CV error for settings with different sample sizes, n, and numbers of predictors, p. The mean error rate over 500 repetitions was obtained for a range of mtry values. The vertical grey dashed line in each plot indicates the most commonly used default choice for mtry in classification tasks, that is p.
Fig 5
Fig 5. Error rate estimates for the real data study.
Shown are different error rate estimates for six real data sets with two or three response classes, respectively, of nearly the same size. The error rate was estimated through the test error, the OOB error, the stratified OOB error, the CV error, and the stratified CV error for settings with different sample sizes, n, and numbers of predictors, p. The mean error rate over 1000 repetitions was obtained for a range of mtry values. The vertical grey dashed line in each plot indicates the most commonly used default choice for mtry in classification tasks, that is p.
Fig 6
Fig 6. Error rate estimates for simulation studies with many predictors with effect and n = 20.
Shown are different error rate estimates for an additional simulation study with two response classes of equal size and many predictor variables with effect. The error rate was estimated through the test error, the OOB error, the stratified OOB error, the CV error, and the stratified CV error for settings with sample size n = 20 and different numbers of predictors, p. The mean error rate over 500 repetitions was obtained for a range of mtry values. The vertical grey dashed line in each plot indicates the most commonly used default choice for mtry in classification tasks, that is p.
Fig 7
Fig 7. Class imbalance in subsamples drawn from a balanced original sample.
Distribution of the frequency of class 1 observations in subsamples of size ⌊0.632n⌋, randomly drawn from a balanced sample with a total of (a) n = 1000, (b) n = 100, and (c) n = 20, observations from classes 1 and 2.
Fig 8
Fig 8. The trees’ preference for predicting the larger class in dependence on mtry.
Fraction of class 1 (minority class in training sample) predictions obtained for balanced test samples with 5000 observations, each from class 1 and 2, and p = 100 (null case setting). Predictions were obtained by RFs with specific mtry (x-axis). RFs were trained on n = 30 observations (10 from class 1 and 20 from class 2) with p = 100. Results are shown for 500 repetitions.
Fig 9
Fig 9. Trees’ preference for predicting larger class in dependence on mtry and number of predictors.
Fraction of class 1 (minority class in training sample) predictions obtained for balanced test samples with 5000 observations from class 1 and 2, each (null case setting). Predictions were obtained by RFs with specific mtry from a corresponding grid of mtry values ({1, 2, …, 10} for p = 10, {1, 10, 20, …, 100} for p = 100, {1, 100, 200, …, 1000} for p = 1000). RFs were trained on n = 30 observations (10 from class 1 and 20 from class 2) with p ∈ {10, 100, 1000}. The mean fractions over 500 repetitions are shown. The grey dots indicate the most commonly used default choices for mtry in classification tasks, that is p.
Fig 10
Fig 10. Error rate estimates for the real data null case study with correlations.
Shown are different error rate estimates for studies based on six real data sets with correlated predictors and two or three response classes, respectively, of nearly the same size. The error rate was estimated through the test error, the OOB error, the stratified OOB error, the CV error, and the stratified CV error for settings with different sample sizes, n, and numbers of predictors, p. The mean error rate over 1000 repetitions was obtained for a range of mtry values. The vertical grey dashed line in each plot indicates the most commonly used default choice for mtry in classification tasks, that is p.
Fig 11
Fig 11. Error rate estimates for the real data null case study without correlations.
Shown are different error rate estimates for studies based on six real data sets with uncorrelated predictors and two or three response classes, respectively, of nearly the same size. The error rate was estimated through the test error, the OOB error, the stratified OOB error, the CV error, and the stratified CV error for settings with different sample sizes, n, and numbers of predictors, p. The mean error rate over 1000 repetitions was obtained for a range of mtry values. The vertical grey dashed line in each plot indicates the most commonly used default choice for mtry in classification tasks, that is p.
Fig 12
Fig 12. The effect of the bias of OOB error on RF’s performance when used for mtry selection.
Performance of RF classifiers when mtry was selected based on the OOB error, the stratified OOB error, the unstratified CV error and the stratified CV error for the additional simulation studies with many variables with effect. The performance of RF was measured using a large independent test data set.

Similar articles

Cited by

References

    1. Breiman L. Random forests. Mach Learn. 2001;45(1):5–32. 10.1023/A:1010933404324 - DOI
    1. Bylander T. Estimating generalization error on two-class datasets using out-of-bag estimates. Mach Learn. 2002;48(1-3):287–297. 10.1023/A:1013964023376 - DOI
    1. Zhang GY, Zhang CX, Zhang JS. Out-of-bag estimation of the optimal hyperparameter in SubBag ensemble method. Commun Stat Simul Comput. 2010;39(10):1877–1892. 10.1080/03610918.2010.521277 - DOI
    1. Goldstein BA, Polley EC, Briggs F. Random forests for genetic association studies. Stat Appl Genet Mol Biol. 2011;10(1):1–34. 10.2202/1544-6115.1691 - DOI - PMC - PubMed
    1. Mitchell MW. Bias of the Random Forest out-of-bag (OOB) error for certain input parameters. Open J Stat. 2011;1(3):205–211. 10.4236/ojs.2011.13024 - DOI

Publication types

Grants and funding

SJ and RH were both funded by grant BO3139/6-1 from the German Science Foundation (URL http://www.dfg.de). SJ was in addition supported by grant BO3139/2-2 from the German Science Foundation. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

LinkOut - more resources