Using a machine learning approach to identify key prognostic molecules for esophageal squamous cell carcinoma

BMC Cancer. 2021 Aug 9;21(1):906. doi: 10.1186/s12885-021-08647-1.

Abstract

Background: A plethora of prognostic biomarkers for esophageal squamous cell carcinoma (ESCC) that have hitherto been reported are challenged with low reproducibility due to high molecular heterogeneity of ESCC. The purpose of this study was to identify the optimal biomarkers for ESCC using machine learning algorithms.

Methods: Biomarkers related to clinical survival, recurrence or therapeutic response of patients with ESCC were determined through literature database searching. Forty-eight biomarkers linked to recurrence or prognosis of ESCC were used to construct a molecular interaction network based on NetBox and then to identify the functional modules. Publicably available mRNA transcriptome data of ESCC downloaded from Gene Expression Omnibus (GEO) and The Cancer Genome Atlas (TCGA) datasets included GSE53625 and TCGA-ESCC. Five machine learning algorithms, including logical regression (LR), support vector machine (SVM), artificial neural network (ANN), random forest (RF) and XGBoost, were used to develop classifiers for prognostic classification for feature selection. The area under ROC curve (AUC) was used to evaluate the performance of the prognostic classifiers. The importances of identified molecules were ranked by their occurrence frequencies in the prognostic classifiers. Kaplan-Meier survival analysis and log-rank test were performed to determine the statistical significance of overall survival.

Results: A total of 48 clinically proven molecules associated with ESCC progression were used to construct a molecular interaction network with 3 functional modules comprising 17 component molecules. The 131,071 prognostic classifiers using these 17 molecules were built for each machine learning algorithm. Using the occurrence frequencies in the prognostic classifiers with AUCs greater than the mean value of all 131,071 AUCs to rank importances of these 17 molecules, stratifin encoded by SFN was identified as the optimal prognostic biomarker for ESCC, whose performance was further validated in another 2 independent cohorts.

Conclusion: The occurrence frequencies across various feature selection approaches reflect the degree of clinical importance and stratifin is an optimal prognostic biomarker for ESCC.

Keywords: Artificial neural network; Esophageal squamous cell carcinoma; Logical regression; Machine learning; Random forest; Stratifin; Support vector machine; eXtreme gradient boosting.

MeSH terms

  • Algorithms
  • Biomarkers, Tumor*
  • Computational Biology
  • Esophageal Squamous Cell Carcinoma / diagnosis*
  • Esophageal Squamous Cell Carcinoma / etiology*
  • Gene Expression Profiling
  • Humans
  • Kaplan-Meier Estimate
  • Machine Learning*
  • Prognosis
  • Reproducibility of Results
  • Transcriptome

Substances

  • Biomarkers, Tumor