Performance of logistic regression modeling: beyond the number of events per variable, the role of data structure

J Clin Epidemiol. 2011 Sep;64(9):993-1000. doi: 10.1016/j.jclinepi.2010.11.012. Epub 2011 Mar 16.


Objective: Logistic regression is commonly used in health research, and it is important to be sure that the parameter estimates can be trusted. A common problem occurs when the outcome has few events; in such a case, parameter estimates may be biased or unreliable. This study examined the relation between correctness of estimation and several data characteristics: number of events per variable (EPV), number of predictors, percentage of predictors that are highly correlated, percentage of predictors that were non-null, size of regression coefficients, and size of correlations.

Study design: Simulation studies.

Results: In many situations, logistic regression modeling may pose substantial problems even if the number of EPV exceeds 10. Moreover, the number of EPV is not the only element that impacts on the correctness of parameter estimation. High regression coefficients and high correlations between the predictors may cause large problems in the estimation process. Finally, power is generally very low, even at 20 EPV.

Conclusion: There is no single rule based on EPV that would guarantee an accurate estimation of logistic regression parameters. Instead, the number of predictors, probable size of the regression coefficients based on previous literature, and correlations among the predictors must be taken into account as guidelines to determine the necessary sample size.

MeSH terms

  • Bias
  • Computer Simulation
  • Humans
  • Logistic Models
  • Research Design*
  • Statistics as Topic / methods*