Development and Validation of a Machine Learning Model Using Administrative Health Data to Predict Onset of Type 2 Diabetes

Mathieu Ravaut; Vinyas Harish; Hamed Sadeghi; Kin Kwan Leung; Maksims Volkovs; Kathy Kornas; Tristan Watson; Tomi Poutanen; Laura C Rosella

doi:10.1001/jamanetworkopen.2021.11315

Development and Validation of a Machine Learning Model Using Administrative Health Data to Predict Onset of Type 2 Diabetes

JAMA Netw Open. 2021 May 3;4(5):e2111315. doi: 10.1001/jamanetworkopen.2021.11315.

Authors

Mathieu Ravaut^{1

2}, Vinyas Harish^{3

4

5

6}, Hamed Sadeghi¹, Kin Kwan Leung¹, Maksims Volkovs¹, Kathy Kornas³, Tristan Watson^{3

7}, Tomi Poutanen¹, Laura C Rosella^{3

4

5

6

7

8}

Affiliations

¹ Layer 6 AI, Toronto, Ontario, Canada.
² Department of Computer Science, University of Toronto, Toronto, Ontario, Canada.
³ Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada.
⁴ Temerty Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada.
⁵ Temerty Centre for Artificial Intelligence Research and Education in Medicine, University of Toronto, Toronto, Ontario, Canada.
⁶ Vector Institute, Toronto, Ontario, Canada.
⁷ Institute of Clinical Evaluative Sciences (ICES), Toronto, Ontario, Canada.
⁸ Institute for Better Health, Trillium Health Partners, Mississauga, Ontario, Canada.

Abstract

Importance: Systems-level barriers to diabetes care could be improved with population health planning tools that accurately discriminate between high- and low-risk groups to guide investments and targeted interventions.

Objective: To develop and validate a population-level machine learning model for predicting type 2 diabetes 5 years before diabetes onset using administrative health data.

Design, setting, and participants: This decision analytical model study used linked administrative health data from the diverse, single-payer health system in Ontario, Canada, between January 1, 2006, and December 31, 2016. A gradient boosting decision tree model was trained on data from 1 657 395 patients, validated on 243 442 patients, and tested on 236 506 patients. Costs associated with each patient were estimated using a validated costing algorithm. Data were analyzed from January 1, 2006, to December 31, 2016.

Exposures: A random sample of 2 137 343 residents of Ontario without type 2 diabetes was obtained at study start time. More than 300 features from data sets capturing demographic information, laboratory measurements, drug benefits, health care system interactions, social determinants of health, and ambulatory care and hospitalization records were compiled over 2-year patient medical histories to generate quarterly predictions.

Main outcomes and measures: Discrimination was assessed using the area under the receiver operating characteristic curve statistic, and calibration was assessed visually using calibration plots. Feature contribution was assessed with Shapley values. Costs were estimated in 2020 US dollars.

Results: This study trained a gradient boosting decision tree model on data from 1 657 395 patients (12 900 257 instances; 6 666 662 women [51.7%]). The developed model achieved a test area under the curve of 80.26 (range, 80.21-80.29), demonstrated good calibration, and was robust to sex, immigration status, area-level marginalization with regard to material deprivation and race/ethnicity, and low contact with the health care system. The top 5% of patients predicted as high risk by the model represented 26% of the total annual diabetes cost in Ontario.

Conclusions and relevance: In this decision analytical model study, a machine learning model approach accurately predicted the incidence of diabetes in the population using routinely collected health administrative data. These results suggest that the model could be used to inform decision-making for population health planning and diabetes prevention.

Publication types

Research Support, Non-U.S. Gov't
Validation Study

MeSH terms

Adolescent
Adult
Age of Onset*
Aged
Aged, 80 and over
Algorithms*
Child
Cohort Studies
Decision Making, Computer-Assisted*
Diabetes Mellitus, Type 2 / diagnosis*
Diabetes Mellitus, Type 2 / epidemiology
Diabetes Mellitus, Type 2 / physiopathology*
Electronic Health Records / statistics & numerical data
Female
Forecasting / methods*
Humans
Incidence
Machine Learning*
Male
Middle Aged
Ontario / epidemiology
Retrospective Studies
Risk Assessment / methods*
Young Adult