Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017;22:207-218.
doi: 10.1142/9789813207813_0021.

MISSING DATA IMPUTATION IN THE ELECTRONIC HEALTH RECORD USING DEEPLY LEARNED AUTOENCODERS

Affiliations
Free PMC article

MISSING DATA IMPUTATION IN THE ELECTRONIC HEALTH RECORD USING DEEPLY LEARNED AUTOENCODERS

Brett K Beaulieu-Jones et al. Pac Symp Biocomput. .
Free PMC article

Abstract

Electronic health records (EHRs) have become a vital source of patient outcome data but the widespread prevalence of missing data presents a major challenge. Different causes of missing data in the EHR data may introduce unintentional bias. Here, we compare the effectiveness of popular multiple imputation strategies with a deeply learned autoencoder using the Pooled Resource Open-Access ALS Clinical Trials Database (PRO-ACT). To evaluate performance, we examined imputation accuracy for known values simulated to be either missing completely at random or missing not at random. We also compared ALS disease progression prediction across different imputation models. Autoencoders showed strong performance for imputation accuracy and contributed to the strongest disease progression predictor. Finally, we show that despite clinical heterogeneity, ALS disease progression appears homogenous with time from onset being the most important predictor.

Figures

Figure 1
Figure 1
Schematic structure of the autoencoder used for evaluations, with two hidden layers and 20% dropout between each layer.
Figure 2
Figure 2
Evaluation outline (a) Imputation Evaluation. PRO-ACT patient data of 10,723 subjects has known data masked with spiked in missing data. Imputation strategies are performed in parallel and the RMSE is calculated between the masked input data and each strategies imputations. (b) Progression Prediction. PRO-ACT patients are imputed using each strategy. Ten-fold cross validation of a random forest regressor is performed on imputed patients.
Figure 3
Figure 3
Histogram distribution and rug plot showing the number of patients each feature is present in. (a) The number of features each patient has. Ticks at the bottom indicate one patient with the count of features, bins indicate the number of patients in a range. (b) The number of patients having a recorded value for each feature. Ticks at the bottom indicate the number of patients a feature is present in, bins indicate the number of features in a range.
Figure 4
Figure 4
Effect of the amount of spiked-in missing data on imputation. Error bars indicate 5-fold cross validation score ranges.
Figure 5
Figure 5
Effect of non-random spiked-in missing data on imputation (measured in root mean squared error). Autoencoder w/Dropout (2 layer 500 nodes each), SVD – SVDImpute with rank of 40, KNN - KNNimpute with 7 neighbors, Mean – Column Mean Averaging, Median – column median averaging, SI – SoftImpute.
Figure 6
Figure 6
ALS Functional Rating Scale prediction accuracy shown for an autoencoder, k-nearest neighbors, mean averaging, median averaging, the raw input including missing values, soft impute and singular value decomposition. The box indicates inner quartiles with the line representing the median; the whiskers indicate outer quartiles excluding outliers.
Figure 7
Figure 7
Prediction feature importance. (a) Importance levels of the top 10 features to the random forest regressor with autoencoder imputed data. (b) Histogram distribution of patient ALSFRS slope levels.

Similar articles

See all similar articles

Cited by 16 articles

See all "Cited by" articles

Publication types

MeSH terms

Feedback