Stepwise selection in small data sets: a simulation study of bias in logistic regression analysis

J Clin Epidemiol. 1999 Oct;52(10):935-42. doi: 10.1016/s0895-4356(99)00103-1.


Stepwise selection methods are widely applied to identify covariables for inclusion in regression models. One of the problems of stepwise selection is biased estimation of the regression coefficients. We illustrate this "selection bias" with logistic regression in the GUSTO-I trial (40,830 patients with an acute myocardial infarction). Random samples were drawn that included 3, 5, 10, 20, or 40 events per variable (EPV). Backward stepwise selection was applied in models containing 8 or 16 pre-specified predictors of 30-day mortality. We found a considerable overestimation of regression coefficients of selected covariables. The selection bias decreased with increasing EPV. For EPV 3, 10, or 40, the bias exceeded 25% for 7, 3, and 1 in the 8-predictor model respectively, when a conventional selection criterion was used (alpha = 0.05). For these EPV values, the bias was less than 20% for all covariables when no selection was applied. We conclude that stepwise selection may result in a substantial bias of estimated regression coefficients.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Aged
  • Clinical Trials as Topic
  • Female
  • Humans
  • Logistic Models*
  • Male
  • Myocardial Infarction / mortality*
  • Predictive Value of Tests
  • Risk Factors
  • Selection Bias*