Classify multicategory outcome in patients with lung adenocarcinoma using clinical, transcriptomic and clinico-transcriptomic data: machine learning versus multinomial models

Am J Cancer Res. 2020 Dec 1;10(12):4624-4639. eCollection 2020.


Classification of multicategory survival-outcome is important for precision oncology. Machine learning (ML) algorithms have been used to accurately classify multi-category survival-outcome of some cancer-types, but not yet that of lung adenocarcinoma. Therefore, we compared the performances of 3 ML models (random forests, support vector machine [SVM], multilayer perceptron) and multinomial logistic regression (Mlogit) models for classifying 4-category survival-outcome of lung adenocarcinoma using the TCGA. Mlogit model overall performed similar to SVM and multilayer perceptron models (micro-average area under curve=0.82), while random forests model was inferior. Surprisingly, transcriptomic data alone and clinico-transcriptomic data appeared sufficient to accurately classify the 4-category survival-outcome in these patients, but no models using clinical data alone performed well. Notably, NDUFS5, P2RY2, PRPF18, CCL24, ZNF813, MYL6, FLJ41941, POU5F1B, and SUV420H1 were the top-ranked genes that were associated with alive without disease and inversely linked to other outcomes. Similarly, BDKRB2, TERC, DNAJA3, MRPL15, SLC16A13, CRHBP and ACSBG2 were associated with alive with progression and GAL3ST3, AD2, RAB41, HDC, and PLEKHG1 associated with dead with disease, respectively, while also inversely linked other outcomes. These cross-linked genes may be used for risk-stratification and future treatment development.

Keywords: Lung adenocarcinoma; cause-specific mortality; machine learning; multilabel classification; survival; transcriptomic.