Automated data extraction and ensemble methods for predictive modeling of breast cancer outcomes after radiation therapy

Med Phys. 2019 Feb;46(2):1054-1063. doi: 10.1002/mp.13314. Epub 2018 Dec 28.

Abstract

Purpose: The purpose of this study was to compare the effectiveness of ensemble methods (e.g., random forests) and single-model methods (e.g., logistic regression and decision trees) in predictive modeling of post-RT treatment failure and adverse events (AEs) for breast cancer patients using automatically extracted EMR data.

Methods: Data from 1967 consecutive breast radiotherapy (RT) courses at one institution between 2008 and 2015 were automatically extracted from EMRs and oncology information systems using extraction software. Over 230 variables were extracted spanning the following variable segments: patient demographics, medical/surgical history, tumor characteristics, RT treatment history, and AEs tracked using CTCAEv4.0. Treatment failure was extracted algorithmically by searching posttreatment encounters for evidence of local, nodal, or distant failure. Individual models were trained using decision trees, logistic regression, random forests, and boosted decision trees to predict treatment failures and AEs. Models were fit on 75% of the data and evaluated for probability calibration and area under the ROC curve (AUC) on the remaining test set. The impact of each variable segment was assessed by retraining without the segment and measuring change in AUC (ΔAUC).

Results: All AUC values were statistically significant (P < 0.05). Ensemble methods outperformed single-model methods across all outcomes. The best ensemble method outperformed decision trees and logistic regression by an average AUC of 0.053 and 0.034, respectively. Model probabilities were well calibrated as evidenced by calibration curves. Excluding the patient medical history variable segment led to the largest AUC reduction in all models (Average ΔAUC = -0.025), followed by RT treatment history (-0.021) and tumor information (-0.015).

Conclusion: In this largest such study in breast cancer performed to date, automatically extracted EMR data provided a basis for reliable outcome predictions across multiple statistical methods. Ensemble methods provided substantial advantages over single-model methods. Patient medical history contributed the most to prediction quality.

Keywords: automated data extraction; ensemble methods; machine learning; predictive modeling; radiotherapy outcomes.

MeSH terms

  • Breast Neoplasms / pathology*
  • Breast Neoplasms / radiotherapy*
  • Data Mining / methods*
  • Decision Trees*
  • Electronic Health Records*
  • Female
  • Humans
  • Machine Learning*
  • Middle Aged
  • Predictive Value of Tests
  • Radiotherapy Dosage
  • Treatment Outcome