A bootstrapping algorithm to improve cohort identification using structured data

J Biomed Inform. 2011 Dec;44 Suppl 1:S63-S68. doi: 10.1016/j.jbi.2011.10.013. Epub 2011 Nov 7.


Cohort identification is an important step in conducting clinical research studies. Use of ICD-9 codes to identify disease cohorts is a common approach that can yield satisfactory results in certain conditions; however, for many use-cases more accurate methods are required. In this study, we propose a bootstrapping method that supplements ICD-9 codes with lab results, medications, etc. to build classification models that can be used to identify cohorts more accurately. The proposed method does not require prior information about the true class of the patients. We used the method to identify Diabetes Mellitus (DM) and Hyperlipidemia (HL) patient cohorts from a database of 800 thousand patients. Evaluation results show that the method identified 11,000 patients who did not have DM related ICD-9 codes as positive for DM and 52,000 patients without HL codes as positive for HL. A review of 400 patient charts (200 patients for each condition) by two clinicians shows that in both the conditions studied, the labeling assigned by the proposed approach is more consistent with that of the clinicians compared to labeling through ICD-9 codes. The method is reasonably automated and, we believe, holds potential for inexpensive, more accurate cohort identification.

Publication types

  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Algorithms*
  • Cohort Studies*
  • Databases, Factual*
  • Diabetes Mellitus / classification
  • Diabetes Mellitus / diagnosis
  • Humans
  • Hyperlipidemias / classification
  • Hyperlipidemias / diagnosis
  • International Classification of Diseases / standards*