Development and validation of a random survival forest model for predicting long-term survival of early-stage young breast cancer patients based on the SEER database and an external validation cohort

Am J Cancer Res. 2024 Apr 15;14(4):1609-1621. doi: 10.62347/OJTY4008. eCollection 2024.

Abstract

Young breast cancer (YBC) patients often face a poor prognosis, hence it's necessary to construct a model that can accurately predict their long-term survival in early stage. To realize this goal, we utilized data from the Surveillance, Epidemiology, and End Results (SEER) databases between January 2010 and December 2020, and meanwhile, enrolled an independent external cohort from Tianjin Medical University Cancer Institute and Hospital. The study aimed to develop and validate a prediction model constructed using the Random Survival Forest (RSF) machine learning algorithm. By applying the Least Absolute Shrinkage and Selection Operator (LASSO) regression analysis, we pinpointed key prognostic factors for YBC patients, which were used to create a prediction model capable of forecasting the 3-year, 5-year, 7-year, and 10-year survival rates of YBC patients. The RSF model constructed in the study demonstrated exceptional performance, achieving C-index values of 0.920 in the training set, 0.789 in the internal validation set, and 0.701 in the external validation set, outperforming the Cox regression model. The model's calibration was confirmed by Brier scores at various time points, showcasing its excellent accuracy in prediction. Decision curve analysis (DCA) underscored the model's importance in clinical application, and the Shapley Additive Explanations (SHAP) plots highlighted the importance of key variables. The RSF model also proved valuable in risk stratification, which has effectively categorized patients based on their survival risks. In summary, this study has constructed a well-performed prediction model for the evaluation of prognostic factors influencing the long-term survival of early-stage YBC patients, which is significant in risk stratification when physicians handle YBC patients in clinical settings.

Keywords: Epidemiology; Young breast cancer; and End Results program (SEER); prediction model; random survival forest; the Surveillance.