Guided multiple imputation of missing data: using a subsample to strengthen the missing-at-random assumption

Epidemiology. 2007 Mar;18(2):246-52. doi: 10.1097/01.ede.0000254708.40228.8b.


Multiple imputation can be a good solution to handling missing data if data are missing at random. However, this assumption is often difficult to verify. We describe an application of multiple imputation that makes this assumption plausible. This procedure requires contacting a random sample of subjects with incomplete data to fill in the missing information, and then adjusting the imputation model to incorporate the new data. Simulations with missing data that were decidedly not missing at random showed, as expected, that the method restored the original beta coefficients, whereas other methods of dealing with missing data failed. Using a dataset with real missing data, we found that different approaches to imputation produced moderately different results. Simulations suggest that filling in 10% of data that was initially missing is sufficient for imputation in many epidemiologic applications, and should produce approximately unbiased results, provided there is a high response on follow-up from the subsample of those with some originally missing data. This response can probably be achieved if this data collection is planned as an initial approach to dealing with the missing data, rather than at later stages, after further attempts that leave only data that is very difficult to complete.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Data Interpretation, Statistical*
  • Epidemiologic Studies
  • Humans
  • Models, Statistical*
  • Sampling Studies