Use of Automated Machine Learning to Detect Undiagnosed Diabetes in US Adults: Development and Validation Study

JMIR AI. 2025 Oct 8:4:e68260. doi: 10.2196/68260.

Abstract

Background: Early diagnosis of diabetes is essential for early interventions to slow the progression of dysglycemia and its comorbidities. However, among individuals with diabetes, about 23% were unaware of their condition.

Objective: This study aims to investigate the potential use of automated machine learning (AutoML) models and self-reported data in detecting undiagnosed diabetes among US adults.

Methods: Individual-level data, including biochemical tests for diabetes, demographic characteristics, family history of diabetes, anthropometric measures, dietary intakes, health behaviors, and chronic conditions, were retrieved from the National Health and Nutrition Examination Survey, 1999-2020. Undiagnosed diabetes was defined as having no prior self-reported diagnosis but meeting diagnostic criteria for elevated hemoglobin A1c, fasting plasma glucose, or 2-hour plasma glucose. The H2O AutoML framework, which allows for automated hyperparameter tuning, model selection, and ensemble learning, was used to automate the machine learning workflow. For comparative analysis, 4 traditional machine learning models-logistic regression, support vector machines, random forest, and extreme gradient boosting-were implemented. Model performance was evaluated using the area under the receiver operating characteristic curve.

Results: The study included 11,815 participants aged 20 years and older, comprising 2256 patients with undiagnosed diabetes and 9559 without diabetes. The average age was 59.76 (SD 15.0) years for participants with undiagnosed diabetes and 46.78 (SD 17.2) years for those without diabetes. The AutoML model demonstrated superior performance compared with the 4 traditional machine learning models. The trained AutoML model achieved an area under the receiver operating characteristic curve of 0.909 (95% CI 0.897-0.921) in the test set. The model demonstrated a sensitivity of 70.26%, specificity of 90.46%, positive predictive value of 64.10%, and negative predictive value of 92.61% for identifying undiagnosed diabetes from nondiabetes.

Conclusions: To our knowledge, this study is the first to utilize the AutoML model for detecting undiagnosed diabetes in US adults. The model's strong performance and applicability to the broader US population make it a promising tool for large-scale diabetes screening efforts.

Keywords: AutoML; machine learning; screening; self-report; undiagnosed diabetes.