Avoiding sparse data bias: an example from gynecologic oncology

J Registry Manag. 2012 Winter;39(4):167-71.


Objective: The purpose of this study is to review the use of 3 statistical techniques that can be employed when analyzing sparse data.

Methods: A cross-sectional prevalence study was conducted using an incidence file from the Texas Cancer Registry covering the years 1995-2006. The records of women who were diagnosed with primary ovarian carcinosarcoma, a rare malignancy with poor survival, were extracted. The exposure variable was race: white patients were compared to black patients. The dichotomous outcome was the presence of distant metastasis at the time of the diagnosis. Given the small sample size and the unbalanced nature of the outcome, we performed the following 3 types of analyses as alternatives to ordinary logistic regression using SAS 9.3 software: Bayesian logistic regression (Monte Carlo sample size of 30,000), exact logistic regression, and logistic regression using penalized maximum likelihood estimation. The race odds ratios (OR) were adjusted for age.

Results: A total of 52 women with carcinosarcoma primary to the ovary were included (47 white, 5 black). The prevalence of distant metastasis was 66% and 60% in the white and black patients, respectively (crude OR, whites compared to blacks: 1.29). None of the adjusted ORs were statistically significant. The adjusted race OR from the Bayesian analysis (1.16) was closer to the null value of 1 than the ORs from the exact logistic model (1.24) and penalized model (1.31).

Conclusions: The most common statistical tests and models encountered in clinical and public health research depend on "large-sample" approximations. However, there are situations in which the minimum number of subjects required is not reached and hence ordinary logistic regression is not appropriate. In these situations, it is beneficial to adopt an alternative strategy such as performing a Bayesian analysis, fitting an exact logistic regression model, or using penalized maximum likelihood estimation.

MeSH terms

  • Age Factors
  • Aged
  • Bayes Theorem
  • Black or African American / statistics & numerical data*
  • Carcinosarcoma / ethnology*
  • Carcinosarcoma / pathology
  • Carcinosarcoma / therapy
  • Cross-Sectional Studies
  • Female
  • Genital Neoplasms, Female / ethnology
  • Humans
  • Middle Aged
  • Neoplasm Metastasis
  • Ovarian Neoplasms / ethnology*
  • Ovarian Neoplasms / pathology
  • Ovarian Neoplasms / therapy
  • Prevalence
  • Registries / statistics & numerical data*
  • Texas / epidemiology
  • White People / statistics & numerical data*