Population-Level Prediction of Type 2 Diabetes From Claims Data and Analysis of Risk Factors

Narges Razavian; Saul Blecker; Ann Marie Schmidt; Aaron Smith-McLallen; Somesh Nigam; David Sontag

doi:10.1089/big.2015.0020

Population-Level Prediction of Type 2 Diabetes From Claims Data and Analysis of Risk Factors

Big Data. 2015 Dec;3(4):277-87. doi: 10.1089/big.2015.0020.

Authors

Narges Razavian¹, Saul Blecker², Ann Marie Schmidt³, Aaron Smith-McLallen⁴, Somesh Nigam⁴, David Sontag¹

Affiliations

¹ 1 Department of Computer Science, New York University , New York, New York.
² 2 Department of Population Health, NYU Langone Medical Center, New York University , New York, New York.
³ 3 Department of Medicine, Department of Biochemistry and Molecular Pharmacology, Department of Pathology Medicine, and Diabetes Research Program, NYU Langone Medical Center, New York University , New York, New York.
⁴ 4 Advanced Analytics, Independence Blue Cross , Philadelphia, Pennsylvania.

PMID: 27441408
DOI: 10.1089/big.2015.0020

Abstract

We present a new approach to population health, in which data-driven predictive models are learned for outcomes such as type 2 diabetes. Our approach enables risk assessment from readily available electronic claims data on large populations, without additional screening cost. Proposed model uncovers early and late-stage risk factors. Using administrative claims, pharmacy records, healthcare utilization, and laboratory results of 4.1 million individuals between 2005 and 2009, an initial set of 42,000 variables were derived that together describe the full health status and history of every individual. Machine learning was then used to methodically enhance predictive variable set and fit models predicting onset of type 2 diabetes in 2009-2011, 2010-2012, and 2011-2013. We compared the enhanced model with a parsimonious model consisting of known diabetes risk factors in a real-world environment, where missing values are common and prevalent. Furthermore, we analyzed novel and known risk factors emerging from the model at different age groups at different stages before the onset. Parsimonious model using 21 classic diabetes risk factors resulted in area under ROC curve (AUC) of 0.75 for diabetes prediction within a 2-year window following the baseline. The enhanced model increased the AUC to 0.80, with about 900 variables selected as predictive (p < 0.0001 for differences between AUCs). Similar improvements were observed for models predicting diabetes onset 1-3 years and 2-4 years after baseline. The enhanced model improved positive predictive value by at least 50% and identified novel surrogate risk factors for type 2 diabetes, such as chronic liver disease (odds ratio [OR] 3.71), high alanine aminotransferase (OR 2.26), esophageal reflux (OR 1.85), and history of acute bronchitis (OR 1.45). Liver risk factors emerge later in the process of diabetes development compared with obesity-related factors such as hypertension and high hemoglobin A1c. In conclusion, population-level risk prediction for type 2 diabetes using readily available administrative data is feasible and has better prediction performance than classical diabetes risk prediction algorithms on very large populations with missing data. The new model enables intervention allocation at national scale quickly and accurately and recovers potentially novel risk factors at different stages before the disease onset.

Keywords: big data analytics; data mining; disease prediction; longitudinal study; machine learning; predictive analytics; risk assessment.