A comprehensive data level analysis for cancer diagnosis on imbalanced data

J Biomed Inform. 2019 Feb:90:103089. doi: 10.1016/j.jbi.2018.12.003. Epub 2019 Jan 3.

Abstract

The early diagnosis of cancer, as one of the major causes of death, is vital for cancerous patients. Diagnosing diseases in general and cancer in particular is a considerable application of data analysis for medical science. However, imbalanced data distribution and imbalanced quality of the majority and minority classes, which lead to misclassification, is a great challenge in this field. Though the samples of the majority class and their proper classification are more important to classifier, cancer is diagnosed by relying on the minority class samples (cancer data class). While the consequence of wrong diagnosis for non-cancerous patients is several additional clinical tests, the cancerous patients pay the price of wrong diagnosis with their lives. As such, studying the class imbalance problem is vital from the medical's perspective. To serve this purpose, a comprehensive study on the consequences of imbalanced data problem is performed in this paper on the data of cancer patients for the first time. In this context, oversampling and under sampling as two main balancing techniques including 18 algorithms are employed. The techniques used in oversampling are ADASYN, ADOMS, AHC, Borderline-SMOTE, ROS, Safe-Level-SMOTE, SMOTE, SMOTE-ENN, SMOTE-TL, SPIDER and SPIDER2, while under sampling techniques are CNN, CNNTL, NCL, OSS, RUS, SBC and TL. To examine the impact of balancers on the performance of classifiers, four classifiers named RIPPER, MLP, KNN, and C4.5 are employed as learners. In addition, 15 cancer data sets from SEER program used for the study are kidney, soft tissue, bladder, rectum, colon, bone, larynx, breast, cervix, prostate, oropharynx, melanoma, thyroid, testis, and lip. The findings of the study are centered on examining the impact of class imbalance on the function of classifiers, a general comparing of the function of pre-processing techniques and classifying all data sets and finally determining the best balancer and classifier for each kind of cancer data set. According to the results, significant improvement is obtained through using balancers. Assessing by AUC, the performance of different classifiers of cancer imbalanced data sets has improved in 90% of the cases after using balancing techniques. To be more precise, Friedman statistical tests are applied and interestingly, each kind of cancer data set responded differently to different balancing techniques and classifiers. Moreover, considering the mean rank of each technique and classifier that were used for data sets, oversampling balancing techniques result in better outcomes than under sampling ones.

Keywords: Classification; Data pre-processing; Diagnosis of cancer; Imbalanced data.

Publication types

  • Review

MeSH terms

  • Algorithms*
  • Data Analysis*
  • Humans
  • Medical Informatics*
  • Neoplasms / diagnosis*