Handling missing predictor values when validating and applying a prediction model to new patients

Jeroen Hoogland; Marit van Barreveld; Thomas P A Debray; Johannes B Reitsma; Tom E Verstraelen; Marcel G W Dijkgraaf; Aeilko H Zwinderman

doi:10.1002/sim.8682

Handling missing predictor values when validating and applying a prediction model to new patients

Stat Med. 2020 Nov 10;39(25):3591-3607. doi: 10.1002/sim.8682. Epub 2020 Jul 20.

Authors

Jeroen Hoogland¹, Marit van Barreveld^{2

3}, Thomas P A Debray^{1

4}, Johannes B Reitsma^{1

4}, Tom E Verstraelen³, Marcel G W Dijkgraaf², Aeilko H Zwinderman²

Affiliations

¹ Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands.
² Department of Clinical Epidemiology, Biostatistics, & Bioinformatics, Academic Medical Center, Amsterdam University Medical Centers, Amsterdam, The Netherlands.
³ Heart Center, Department of Cardiology, Amsterdam University Medical Centers, University of Amsterdam, Amsterdam, The Netherlands.
⁴ Cochrane Netherlands, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands.

Abstract

Missing data present challenges for development and real-world application of clinical prediction models. While these challenges have received considerable attention in the development setting, there is only sparse research on the handling of missing data in applied settings. The main unique feature of handling missing data in these settings is that missing data methods have to be performed for a single new individual, precluding direct application of mainstay methods used during model development. Correspondingly, we propose that it is desirable to perform model validation using missing data methods that transfer to practice in single new patients. This article compares existing and new methods to account for missing data for a new individual in the context of prediction. These methods are based on (i) submodels based on observed data only, (ii) marginalization over the missing variables, or (iii) imputation based on fully conditional specification (also known as chained equations). They were compared in an internal validation setting to highlight the use of missing data methods that transfer to practice while validating a model. As a reference, they were compared to the use of multiple imputation by chained equations in a set of test patients, because this has been used in validation studies in the past. The methods were evaluated in a simulation study where performance was measured by means of optimism corrected C-statistic and mean squared prediction error. Furthermore, they were applied in data from a large Dutch cohort of prophylactic implantable cardioverter defibrillator patients.

Keywords: clinical prediction modeling; missing data; real-world application; validation.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Cohort Studies
Computer Simulation*
Humans