Comparing Three Data Mining Algorithms for Identifying the Associated Risk Factors of Type 2 Diabetes

Iran Biomed J. 2018 Sep;22(5):303-11. doi: 10.29252/ibj.22.5.303. Epub 2018 Jan 27.


Background: Increasing the prevalence of type 2 diabetes has given rise to a global health burden and a concern among health service providers and health administrators. The current study aimed at developing and comparing some statistical models to identify the risk factors associated with type 2 diabetes. In this light, artificial neural network (ANN), support vector machines (SVMs), and multiple logistic regression (MLR) models were applied, using demographic, anthropometric, and biochemical characteristics, on a sample of 9528 individuals from Mashhad City in Iran.

Methods: This study has randomly selected 6654 (70%) cases for training and reserved the remaining 2874 (30%) cases for testing. The three methods were compared with the help of ROC curve.

Results: The prevalence rate of type 2 diabetes was 14% in our population. The ANN model had 78.7% accuracy, 63.1% sensitivity, and 81.2% specificity. Also, the values of these three parameters were 76.8%, 64.5%, and 78.9%, for SVM and 77.7%, 60.1%, and 80.5% for MLR. The area under the ROC curve was 0.71 for ANN, 0.73 for SVM, and 0.70 for MLR.

Conclusion: Our findings showed that ANN performs better than the two models (SVM and MLR) and can be used effectively to identify the associated risk factors of type 2 diabetes.

Keywords: Support vector machine; Data mining; Diabetes type 2.

Publication types

  • Comparative Study

MeSH terms

  • Adult
  • Algorithms*
  • Data Mining / methods*
  • Data Mining / standards
  • Diabetes Mellitus, Type 2 / diagnosis*
  • Diabetes Mellitus, Type 2 / epidemiology*
  • Female
  • Humans
  • Iran / epidemiology
  • Male
  • Middle Aged
  • Neural Networks, Computer*
  • Risk Factors
  • Support Vector Machine* / standards