Predicting Early-Onset Colorectal Cancer with Large Language Models

AMIA Annu Symp Proc. 2025 May 22:2024:653-662. eCollection 2024.

Abstract

The incidence rate of early-onset colorectal cancer (EoCRC, age < 45) has increased every year, but this population is younger than the recommended age established by national guidelines for cancer screening. In this paper, we applied 10 different machine learning models to predict EoCRC, and compared their performance with advanced large language models (LLM), using patient conditions, lab results, and observations within 6 months of patient journey prior to the CRC diagnoses. We retrospectively identified 1,953 CRC patients from multiple health systems across the United States. The results demonstrated that the fine-tuned LLM achieved an average of 73% sensitivity and 91% specificity.

MeSH terms

  • Age of Onset
  • Colorectal Neoplasms* / diagnosis
  • Colorectal Neoplasms* / epidemiology
  • Humans
  • Large Language Models
  • Machine Learning*
  • Middle Aged
  • Prediction Algorithms
  • Predictive Learning Models
  • Retrospective Studies
  • Sensitivity and Specificity
  • United States / epidemiology