Sequential BART for imputation of missing covariates

Biostatistics. 2016 Jul;17(3):589-602. doi: 10.1093/biostatistics/kxw009. Epub 2016 Mar 15.

Abstract

To conduct comparative effectiveness research using electronic health records (EHR), many covariates are typically needed to adjust for selection and confounding biases. Unfortunately, it is typical to have missingness in these covariates. Just using cases with complete covariates will result in considerable efficiency losses and likely bias. Here, we consider the covariates missing at random with missing data mechanism either depending on the response or not. Standard methods for multiple imputation can either fail to capture nonlinear relationships or suffer from the incompatibility and uncongeniality issues. We explore a flexible Bayesian nonparametric approach to impute the missing covariates, which involves factoring the joint distribution of the covariates with missingness into a set of sequential conditionals and applying Bayesian additive regression trees to model each of these univariate conditionals. Using data augmentation, the posterior for each conditional can be sampled simultaneously. We provide details on the computational algorithm and make comparisons to other methods, including parametric sequential imputation and two versions of multiple imputation by chained equations. We illustrate the proposed approach on EHR data from an affiliated tertiary care institution to examine factors related to hyperglycemia.

Keywords: Bayesian additive regression trees; Congenial models; Multiple imputation.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Bayes Theorem*
  • Data Interpretation, Statistical*
  • Electronic Health Records
  • Humans
  • Models, Statistical*
  • Regression Analysis*