Machine learning applications in cancer prognosis and prediction

Comput Struct Biotechnol J. 2014 Nov 15;13:8-17. doi: 10.1016/j.csbj.2014.11.005. eCollection 2015.


Cancer has been characterized as a heterogeneous disease consisting of many different subtypes. The early diagnosis and prognosis of a cancer type have become a necessity in cancer research, as it can facilitate the subsequent clinical management of patients. The importance of classifying cancer patients into high or low risk groups has led many research teams, from the biomedical and the bioinformatics field, to study the application of machine learning (ML) methods. Therefore, these techniques have been utilized as an aim to model the progression and treatment of cancerous conditions. In addition, the ability of ML tools to detect key features from complex datasets reveals their importance. A variety of these techniques, including Artificial Neural Networks (ANNs), Bayesian Networks (BNs), Support Vector Machines (SVMs) and Decision Trees (DTs) have been widely applied in cancer research for the development of predictive models, resulting in effective and accurate decision making. Even though it is evident that the use of ML methods can improve our understanding of cancer progression, an appropriate level of validation is needed in order for these methods to be considered in the everyday clinical practice. In this work, we present a review of recent ML approaches employed in the modeling of cancer progression. The predictive models discussed here are based on various supervised ML techniques as well as on different input features and data samples. Given the growing trend on the application of ML methods in cancer research, we present here the most recent publications that employ these techniques as an aim to model cancer risk or patient outcomes.

Keywords: ANN, Artificial Neural Network; AUC, Area Under Curve; BCRSVM, Breast Cancer Support Vector Machine; BN, Bayesian Network; CFS, Correlation based Feature Selection; Cancer recurrence; Cancer survival; Cancer susceptibility; DT, Decision Tree; ES, Early Stopping algorithm; GEO, Gene Expression Omnibus; HTT, High-throughput Technologies; LCS, Learning Classifying Systems; ML, Machine Learning; Machine learning; NCI caArray, National Cancer Institute Array Data Management System; NSCLC, Non-small Cell Lung Cancer; OSCC, Oral Squamous Cell Carcinoma; PPI, Protein–Protein Interaction; Predictive models; ROC, Receiver Operating Characteristic; SEER, Surveillance, Epidemiology and End results Database; SSL, Semi-supervised Learning; SVM, Support Vector Machine; TCGA, The Cancer Genome Atlas Research Network.

Publication types

  • Review