Prognostic modelling with logistic regression analysis: a comparison of selection and estimation methods in small data sets

Stat Med. 2000 Apr 30;19(8):1059-79. doi: 10.1002/(sici)1097-0258(20000430)19:8<1059::aid-sim412>;2-0.


Logistic regression analysis may well be used to develop a prognostic model for a dichotomous outcome. Especially when limited data are available, it is difficult to determine an appropriate selection of covariables for inclusion in such models. Also, predictions may be improved by applying some sort of shrinkage in the estimation of regression coefficients. In this study we compare the performance of several selection and shrinkage methods in small data sets of patients with acute myocardial infarction, where we aim to predict 30-day mortality. Selection methods included backward stepwise selection with significance levels alpha of 0.01, 0.05, 0. 157 (the AIC criterion) or 0.50, and the use of qualitative external information on the sign of regression coefficients in the model. Estimation methods included standard maximum likelihood, the use of a linear shrinkage factor, penalized maximum likelihood, the Lasso, or quantitative external information on univariable regression coefficients. We found that stepwise selection with a low alpha (for example, 0.05) led to a relatively poor model performance, when evaluated on independent data. Substantially better performance was obtained with full models with a limited number of important predictors, where regression coefficients were reduced with any of the shrinkage methods. Incorporation of external information for selection and estimation improved the stability and quality of the prognostic models. We therefore recommend shrinkage methods in full models including prespecified predictors and incorporation of external information, when prognostic models are constructed in small data sets.

Publication types

  • Comparative Study
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Age Factors
  • Aged
  • Female
  • Forecasting
  • Humans
  • Logistic Models*
  • Male
  • Models, Statistical*
  • Myocardial Infarction / mortality*
  • Prognosis
  • Regression Analysis*
  • Risk Factors
  • Sample Size
  • Sex Factors