Development and validation of a predictive model for detection of colorectal cancer in primary care by analysis of complete blood counts: a binational retrospective study

J Am Med Inform Assoc. 2016 Sep;23(5):879-90. doi: 10.1093/jamia/ocv195. Epub 2016 Feb 15.


Objective: The use of risk prediction models grows as electronic medical records become widely available. Here, we develop and validate a model to identify individuals at increased risk for colorectal cancer (CRC) by analyzing blood counts, age, and sex, then determine the model's value when used to supplement conventional screening.

Materials and methods: Primary care data were collected from a cohort of 606 403 Israelis (of whom 3135 were diagnosed with CRC) and a case control UK dataset of 5061 CRC cases and 25 613 controls. The model was developed on 80% of the Israeli dataset and validated using the remaining Israeli and UK datasets. Performance was evaluated according to the area under the curve, specificity, and odds ratio at several working points.

Results: Using blood counts obtained 3-6 months before diagnosis, the area under the curve for detecting CRC was 0.82 ± 0.01 for the Israeli validation set. The specificity was 88 ± 2% in the Israeli validation set and 94 ± 1% in the UK dataset. Detecting 50% of CRC cases, the odds ratio was 26 ± 5 and 40 ± 6, respectively, for a false-positive rate of 0.5%. Specificity for 50% detection was 87 ± 2% a year before diagnosis and 85 ± 2% for localized cancers. When used in addition to the fecal occult blood test, our model enabled more than a 2-fold increase in CRC detection.

Discussion: Comparable results in 2 unrelated populations suggest that the model should generally apply to the detection of CRC in other groups. The model's performance is superior to current iron deficiency anemia management guidelines, and may help physicians to identify individuals requiring additional clinical evaluation.

Conclusions: Our model may help to detect CRC earlier in clinical practice.

Keywords: colorectal cancer; early detection of cancer; electronic medical records; machine learning; primary health care; risk prediction.

Publication types

  • Validation Study

MeSH terms

  • Adult
  • Anemia, Iron-Deficiency / diagnosis
  • Area Under Curve
  • Blood Cell Count*
  • Colorectal Neoplasms / blood
  • Colorectal Neoplasms / diagnosis*
  • Decision Trees
  • Early Detection of Cancer / methods*
  • Female
  • Humans
  • Machine Learning
  • Male
  • Middle Aged
  • Occult Blood*
  • Primary Health Care
  • Retrospective Studies
  • Risk Assessment
  • Sensitivity and Specificity