Problems due to small samples and sparse data in conditional logistic regression analysis

Am J Epidemiol. 2000 Mar 1;151(5):531-9. doi: 10.1093/oxfordjournals.aje.a010240.


Conditional logistic regression was developed to avoid "sparse-data" biases that can arise in ordinary logistic regression analysis. Nonetheless, it is a large-sample method that can exhibit considerable bias when certain types of matched sets are infrequent or when the model contains too many parameters. Sparse-data bias can cause misleading inferences about confounding, effect modification, dose response, and induction periods, and can interact with other biases. In this paper, the authors describe these problems in the context of matched case-control analysis and provide examples from a study of electrical wiring and childhood leukemia and a study of diet and glioma. The same problems can arise in any likelihood-based analysis, including ordinary logistic regression. The problems can be detected by careful inspection of data and by examining the sensitivity of estimates to category boundaries, variables in the model, and transformations of those variables. One can also apply various bias corrections or turn to methods less sensitive to sparse data than conditional likelihood, such as Bayesian and empirical-Bayes (hierarchical regression) methods.

Publication types

  • Research Support, Non-U.S. Gov't
  • Review

MeSH terms

  • Bias*
  • Case-Control Studies
  • Central Nervous System Neoplasms / epidemiology
  • Central Nervous System Neoplasms / etiology
  • Child
  • Diet
  • Electromagnetic Fields / adverse effects
  • Epidemiologic Methods*
  • Glioma / epidemiology
  • Glioma / etiology
  • Humans
  • Leukemia / epidemiology
  • Leukemia / etiology
  • Likelihood Functions
  • Logistic Models*
  • Matched-Pair Analysis
  • Odds Ratio
  • Regression Analysis*
  • Risk Assessment