Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Nov;58(5):1353-1359.e6.
doi: 10.1016/j.jvs.2013.05.008. Epub 2013 Jul 2.

Comparative methods for handling missing data in large databases

Affiliations
Free article

Comparative methods for handling missing data in large databases

Antonia J Henry et al. J Vasc Surg. 2013 Nov.
Free article

Abstract

Objective: Analysis of complex survey databases is an important tool for health services researchers. Missing data elements are challenging because the reasons for "missingness" are multifactorial, especially categorical variables such as race. We simulated missing data for race and analyzed the bias from five methods used in predicting major amputation in patients with critical limb ischemia (CLI).

Methods: Patient discharges with fully observed data containing lower extremity revascularization or major amputation and CLI were selected from the 2003 to 2007 Nationwide Inpatient Sample, a complex survey database (weighted n = 684,057). Considering several random missing data schemes, we compared five missing data methods: complete case analysis, replacement with observed frequencies, missing indicator variable, multiple imputation, and reweighted estimating equations. We created 100 simulated data sets, with 5%, 15%, or 30% of subjects' race drawn to be missing from the full data set. Bias was estimated by comparing the estimated regression coefficients averaged over 100 simulated data sets (β(miss)) from each method vs estimates from the fully observed data set (β(full)), with relative bias calculated as (β(full) - β(miss)/β(full)) × 100%.

Results: Our results demonstrate that reweighted estimating equations produce the least biased and the missing indicator variable produces the most biased coefficients. Complete case analysis, replacement with observed frequencies, and multiple imputation resulted in moderate bias. Sensitivity analysis demonstrated the optimal method choice depends on the quantity and type of missing data encountered.

Conclusions: Missing data are an important analytic topic in research with large databases. The commonly used missing indicator variable method introduces severe bias and should be used with caution. We present empiric evidence to guide method selection for handling missing data.

PubMed Disclaimer

Similar articles

Cited by

Publication types

MeSH terms