Development and Validation of a Machine Learning Model Using Administrative Health Data to Predict Onset of Type 2 Diabetes

JAMA Netw Open. 2021 May 3;4(5):e2111315. doi: 10.1001/jamanetworkopen.2021.11315.

Abstract

Importance: Systems-level barriers to diabetes care could be improved with population health planning tools that accurately discriminate between high- and low-risk groups to guide investments and targeted interventions.

Objective: To develop and validate a population-level machine learning model for predicting type 2 diabetes 5 years before diabetes onset using administrative health data.

Design, setting, and participants: This decision analytical model study used linked administrative health data from the diverse, single-payer health system in Ontario, Canada, between January 1, 2006, and December 31, 2016. A gradient boosting decision tree model was trained on data from 1 657 395 patients, validated on 243 442 patients, and tested on 236 506 patients. Costs associated with each patient were estimated using a validated costing algorithm. Data were analyzed from January 1, 2006, to December 31, 2016.

Exposures: A random sample of 2 137 343 residents of Ontario without type 2 diabetes was obtained at study start time. More than 300 features from data sets capturing demographic information, laboratory measurements, drug benefits, health care system interactions, social determinants of health, and ambulatory care and hospitalization records were compiled over 2-year patient medical histories to generate quarterly predictions.

Main outcomes and measures: Discrimination was assessed using the area under the receiver operating characteristic curve statistic, and calibration was assessed visually using calibration plots. Feature contribution was assessed with Shapley values. Costs were estimated in 2020 US dollars.

Results: This study trained a gradient boosting decision tree model on data from 1 657 395 patients (12 900 257 instances; 6 666 662 women [51.7%]). The developed model achieved a test area under the curve of 80.26 (range, 80.21-80.29), demonstrated good calibration, and was robust to sex, immigration status, area-level marginalization with regard to material deprivation and race/ethnicity, and low contact with the health care system. The top 5% of patients predicted as high risk by the model represented 26% of the total annual diabetes cost in Ontario.

Conclusions and relevance: In this decision analytical model study, a machine learning model approach accurately predicted the incidence of diabetes in the population using routinely collected health administrative data. These results suggest that the model could be used to inform decision-making for population health planning and diabetes prevention.

Publication types

  • Research Support, Non-U.S. Gov't
  • Validation Study

MeSH terms

  • Adolescent
  • Adult
  • Age of Onset*
  • Aged
  • Aged, 80 and over
  • Algorithms*
  • Child
  • Cohort Studies
  • Decision Making, Computer-Assisted*
  • Diabetes Mellitus, Type 2 / diagnosis*
  • Diabetes Mellitus, Type 2 / epidemiology
  • Diabetes Mellitus, Type 2 / physiopathology*
  • Electronic Health Records / statistics & numerical data
  • Female
  • Forecasting / methods*
  • Humans
  • Incidence
  • Machine Learning*
  • Male
  • Middle Aged
  • Ontario / epidemiology
  • Retrospective Studies
  • Risk Assessment / methods*
  • Young Adult