Data mining a diabetic data warehouse

Artif Intell Med. Sep-Oct 2002;26(1-2):37-54. doi: 10.1016/s0933-3657(02)00051-9.


Diabetes is a major health problem in the United States. There is a long history of diabetic registries and databases with systematically collected patient information. We examine one such diabetic data warehouse, showing a method of applying data mining techniques, and some of the data issues, analysis problems, and results. The diabetic data warehouse is from a large integrated health care system in the New Orleans area with 30,383 diabetic patients. Methods for translating a complex relational database with time series and sequencing information to a flat file suitable for data mining are challenging. We discuss two variables in detail, a comorbidity index and the HgbA1c, a measure of glycemic control related to outcomes. We used the classification tree approach in Classification and Regression Trees (CART) with a binary target variable of HgbA1c >9.5 and 10 predictors: age, sex, emergency department visits, office visits, comorbidity index, dyslipidemia, hypertension, cardiovascular disease, retinopathy, end-stage renal disease. Unexpectedly, the most important variable associated with bad glycemic control is younger age, not the comorbiditity index or whether patients have related diseases. If we want to target diabetics with bad HgbA1c values, the odds of finding them is 3.2 times as high in those <65 years of age than those older. Data mining can discover novel associations that are useful to clinicians and administrators [corrected].

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Adult
  • Age Factors
  • Aged
  • Comorbidity
  • Databases, Factual
  • Diabetes Mellitus*
  • Female
  • Humans
  • Hyperglycemia / etiology
  • Hyperglycemia / therapy
  • Hypoglycemia / etiology
  • Hypoglycemia / therapy
  • Information Storage and Retrieval*
  • Male
  • Middle Aged
  • Registries / statistics & numerical data*
  • Software