Multimetric feature selection for analyzing multicategory outcomes of colorectal cancer: random forest and multinomial logistic regression models

Lab Invest. 2022 Mar;102(3):236-244. doi: 10.1038/s41374-021-00662-x. Epub 2021 Sep 18.


Colorectal cancer (CRC) is one of the most common cancers worldwide, and a leading cause of cancer deaths. Better classifying multicategory outcomes of CRC with clinical and omic data may help adjust treatment regimens based on individual's risk. Here, we selected the features that were useful for classifying four-category survival outcome of CRC using the clinical and transcriptomic data, or clinical, transcriptomic, microsatellite instability and selected oncogenic-driver data (all data) of TCGA. We also optimized multimetric feature selection to develop the best multinomial logistic regression (MLR) and random forest (RF) models that had the highest accuracy, precision, recall and F1 score, respectively. We identified 2073 differentially expressed genes of the TCGA RNASeq dataset. MLR overall outperformed RF in the multimetric feature selection. In both RF and MLR models, precision, recall and F1 score increased as the feature number increased and peaked at the feature number of 600-1000, while the models' accuracy remained stable. The best model was the MLR one with 825 features based on sum of squared coefficients using all data, and attained the best accuracy of 0.855, F1 of 0.738 and precision of 0.832, which were higher than those using clinical and transcriptomic data. The top-ranked features in the MLR model of the best performance using clinical and transcriptomic data were different from those using all data. However, pathologic staging, HBS1L, TSPYL4, and TP53TG3B were the overlapping top-20 ranked features in the best models using clinical and transcriptomic, or all data. Thus, we developed a multimetric feature-selection based MLR model that outperformed RF models in classifying four-category outcome of CRC patients. Interestingly, adding microsatellite instability and oncogenic-driver data to clinical and transcriptomic data improved models' performances. Precision and recall of tuned algorithms may change significantly as the feature number changes, but accuracy appears not sensitive to these changes.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Adult
  • Aged
  • Colorectal Neoplasms / genetics*
  • Colorectal Neoplasms / pathology
  • Colorectal Neoplasms / therapy
  • Female
  • Gene Expression Profiling / methods*
  • Gene Expression Regulation, Neoplastic*
  • Humans
  • Logistic Models
  • Male
  • Microsatellite Instability
  • Middle Aged
  • Oncogenes / genetics
  • Outcome Assessment, Health Care / classification
  • Outcome Assessment, Health Care / methods*
  • Outcome Assessment, Health Care / statistics & numerical data
  • RNA-Seq / methods
  • Reproducibility of Results