Using machine-learning algorithms to improve imputation in the medical expenditure panel survey

Health Serv Res. 2023 Apr;58(2):423-432. doi: 10.1111/1475-6773.14115. Epub 2022 Dec 25.


Objective: To assess the feasibility of applying machine learning (ML) methods to imputation in the Medical Expenditure Panel Survey (MEPS).

Data sources: All data come from the 2016-2017 MEPS.

Study design: Currently, expenditures for medical encounters in the MEPS are imputed with a predictive mean matching (PMM) algorithm in which a linear regression model is used to predict expenditures for events with (donors) and without (recipients) data. Recipient events and donor events are then matched based on the smallest distance between predicted expenditures, and the donor event's expenditures are used as the recipient event's imputation. We replace linear regression algorithm in the PMM framework with ML methods to predict expenditures. We examine five alternatives to linear regression: Gradient Boosting, Random Forests, Extreme Random Forests, Deep Neural Networks, and a Stacked Ensemble approach. Additionally, we introduce an alternative matching scheme, which matches on a vector of predicted expenditures by sources of payment instead of a single total expenditure prediction to generate potentially superior matches.

Data collection: Study data is derived from a large federal survey.

Principal findings: ML algorithms perform better at both prediction and matching imputation than Ordinary Least Squares (OLS), the most common prediction algorithm used in PMM. On average, the Stacked Ensemble approach that combines all the ML algorithms performs best, improving expenditure prediction R2 by 108% (0.156 points) and final imputation R2 by 227% (0.397 points). Matching on a prediction vector also improves alignment of sources of payments between donor and recipient events.

Conclusions: ML algorithms and an alternative matching scheme improve the overall quality of expenditure PMM imputation in the MEPS. These methods may have additional value in other national surveys that currently rely on PMM or similar methods for imputation.

Keywords: MEPS; imputation; machine learning; medical expenditures; predictive mean matching.

MeSH terms

  • Algorithms*
  • Health Expenditures*
  • Humans
  • Linear Models
  • Machine Learning
  • Surveys and Questionnaires