The effects of data sources, cohort selection, and outcome definition on a predictive model of risk of thirty-day hospital readmissions

J Biomed Inform. 2014 Dec;52:418-26. doi: 10.1016/j.jbi.2014.08.006. Epub 2014 Aug 23.


Background: Hospital readmission risk prediction remains a motivated area of investigation and operations in light of the hospital readmissions reduction program through CMS. Multiple models of risk have been reported with variable discriminatory performances, and it remains unclear how design factors affect performance.

Objectives: To study the effects of varying three factors of model development in the prediction of risk based on health record data: (1) reason for readmission (primary readmission diagnosis); (2) available data and data types (e.g. visit history, laboratory results, etc); (3) cohort selection.

Methods: Regularized regression (LASSO) to generate predictions of readmissions risk using prevalence sampling. Support Vector Machine (SVM) used for comparison in cohort selection testing. Calibration by model refitting to outcome prevalence.

Results: Predicting readmission risk across multiple reasons for readmission resulted in ROC areas ranging from 0.92 for readmission for congestive heart failure to 0.71 for syncope and 0.68 for all-cause readmission. Visit history and laboratory tests contributed the most predictive value; contributions varied by readmission diagnosis. Cohort definition affected performance for both parametric and nonparametric algorithms. Compared to all patients, limiting the cohort to patients whose index admission and readmission diagnoses matched resulted in a decrease in average ROC from 0.78 to 0.55 (difference in ROC 0.23, p value 0.01). Calibration plots demonstrate good calibration with low mean squared error.

Conclusion: Targeting reason for readmission in risk prediction impacted discriminatory performance. In general, laboratory data and visit history data contributed the most to prediction; data source contributions varied by reason for readmission. Cohort selection had a large impact on model performance, and these results demonstrate the difficulty of comparing results across different studies of predictive risk modeling.

Keywords: Electronic health record; Predictive analytics; Readmissions; Regularized Logistic Regression; Risk modeling; Text mining.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Adolescent
  • Adult
  • Aged
  • Data Mining / methods*
  • Electronic Health Records / statistics & numerical data*
  • Female
  • Humans
  • Male
  • Middle Aged
  • Models, Statistical*
  • Patient Readmission / statistics & numerical data*
  • Reproducibility of Results
  • Research Design / statistics & numerical data*
  • Retrospective Studies
  • Risk Assessment
  • Young Adult