Computer-aided diagnosis of lung cancer: the effect of training data sets on classification accuracy of lung nodules

Phys Med Biol. 2018 Feb 5;63(3):035036. doi: 10.1088/1361-6560/aaa610.

Abstract

This study aims to develop a computer-aided diagnosis (CADx) scheme for classification between malignant and benign lung nodules, and also assess whether CADx performance changes in detecting nodules associated with early and advanced stage lung cancer. The study involves 243 biopsy-confirmed pulmonary nodules. Among them, 76 are benign, 81 are stage I and 86 are stage III malignant nodules. The cases are separated into three data sets involving: (1) all nodules, (2) benign and stage I malignant nodules, and (3) benign and stage III malignant nodules. A CADx scheme is applied to segment lung nodules depicted on computed tomography images and we initially computed 66 3D image features. Then, three machine learning models namely, a support vector machine, naïve Bayes classifier and linear discriminant analysis, are separately trained and tested by using three data sets and a leave-one-case-out cross-validation method embedded with a Relief-F feature selection algorithm. When separately using three data sets to train and test three classifiers, the average areas under receiver operating characteristic curves (AUC) are 0.94, 0.90 and 0.99, respectively. When using the classifiers trained using data sets with all nodules, average AUC values are 0.88 and 0.99 for detecting early and advanced stage nodules, respectively. AUC values computed from three classifiers trained using the same data set are consistent without statistically significant difference (p > 0.05). This study demonstrates (1) the feasibility of applying a CADx scheme to accurately distinguish between benign and malignant lung nodules, and (2) a positive trend between CADx performance and cancer progression stage. Thus, in order to increase CADx performance in detecting subtle and early cancer, training data sets should include more diverse early stage cancer cases.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Bayes Theorem
  • Carcinoma, Non-Small-Cell Lung / diagnosis*
  • Carcinoma, Non-Small-Cell Lung / diagnostic imaging
  • Case-Control Studies
  • Diagnosis, Computer-Assisted / methods*
  • Female
  • Humans
  • Imaging, Three-Dimensional
  • Lung Neoplasms / diagnosis*
  • Lung Neoplasms / diagnostic imaging
  • Machine Learning
  • Male
  • Multiple Pulmonary Nodules / classification*
  • Multiple Pulmonary Nodules / diagnosis*
  • Multiple Pulmonary Nodules / diagnostic imaging
  • Neoplasm Staging
  • ROC Curve
  • Retrospective Studies
  • Support Vector Machine
  • Tomography, X-Ray Computed / methods*