The effects of data sources, cohort selection, and outcome definition on a predictive model of risk of thirty-day hospital readmissions

Colin Walsh; George Hripcsak

doi:10.1016/j.jbi.2014.08.006

The effects of data sources, cohort selection, and outcome definition on a predictive model of risk of thirty-day hospital readmissions

J Biomed Inform. 2014 Dec:52:418-26. doi: 10.1016/j.jbi.2014.08.006. Epub 2014 Aug 23.

Authors

Colin Walsh¹, George Hripcsak²

Affiliations

¹ Department of Biomedical Informatics, Columbia University, United States; Department of Medicine, Columbia University, United States. Electronic address: cgw2106@columbia.edu.
² Department of Biomedical Informatics, Columbia University, United States.

Abstract

Background: Hospital readmission risk prediction remains a motivated area of investigation and operations in light of the hospital readmissions reduction program through CMS. Multiple models of risk have been reported with variable discriminatory performances, and it remains unclear how design factors affect performance.

Objectives: To study the effects of varying three factors of model development in the prediction of risk based on health record data: (1) reason for readmission (primary readmission diagnosis); (2) available data and data types (e.g. visit history, laboratory results, etc); (3) cohort selection.

Methods: Regularized regression (LASSO) to generate predictions of readmissions risk using prevalence sampling. Support Vector Machine (SVM) used for comparison in cohort selection testing. Calibration by model refitting to outcome prevalence.

Results: Predicting readmission risk across multiple reasons for readmission resulted in ROC areas ranging from 0.92 for readmission for congestive heart failure to 0.71 for syncope and 0.68 for all-cause readmission. Visit history and laboratory tests contributed the most predictive value; contributions varied by readmission diagnosis. Cohort definition affected performance for both parametric and nonparametric algorithms. Compared to all patients, limiting the cohort to patients whose index admission and readmission diagnoses matched resulted in a decrease in average ROC from 0.78 to 0.55 (difference in ROC 0.23, p value 0.01). Calibration plots demonstrate good calibration with low mean squared error.

Conclusion: Targeting reason for readmission in risk prediction impacted discriminatory performance. In general, laboratory data and visit history data contributed the most to prediction; data source contributions varied by reason for readmission. Cohort selection had a large impact on model performance, and these results demonstrate the difficulty of comparing results across different studies of predictive risk modeling.

Keywords: Electronic health record; Predictive analytics; Readmissions; Regularized Logistic Regression; Risk modeling; Text mining.

Publication types

Research Support, N.I.H., Extramural

MeSH terms

Adolescent
Adult
Aged
Data Mining / methods*
Electronic Health Records / statistics & numerical data*
Female
Humans
Male
Middle Aged
Models, Statistical*
Patient Readmission / statistics & numerical data*
Reproducibility of Results
Research Design / statistics & numerical data*
Retrospective Studies
Risk Assessment
Young Adult

Abstract

Publication types

MeSH terms

Grants and funding