Automated Tumor International Classification of Diseases Coding of Real-World Pathology Reports Using Self-Hosted Large Language Models

JCO Clin Cancer Inform. 2026 Mar:10:e2500254. doi: 10.1200/CCI-25-00254. Epub 2026 Mar 11.

Abstract

Purpose: Manual coding of pathology reports with International Classification of Diseases for Oncology (ICD-O)-3 codes is time-consuming, error-prone, and resource-intensive for health care institutions. To evaluate the performance of multiple state-of-the-art large language models (LLMs) in extracting ICD-O-3 topography and morphology codes from real-world pathology reports and assess their potential for clinical implementation, this study compares the performance of state-of-the-art open-source models in multiple evaluation setups.

Methods: We analyzed 21,364 pathology reports from 10,823 patients documented between 2013 and 2025 at a large German hospital. Five LLMs were evaluated: Llama-3.3-70B-Instruct, DeepSeek-R1-Distill-Llama (8B and 70B variants), Qwen3-235B-A22B, and Gemma-3-12B-it. All models were deployed on secured private information technology hospital infrastructure. Three different prompts were developed for topography extraction (with and without anatomic context) and morphology extraction. Performance was evaluated using exact code matches and first three-position matches.

Results: For exact ICD-O topography code prediction, Qwen3-235B-A22B achieved the highest performance (microaverage F1: 71.6%), whereas Llama-3.3-70B-Instruct performed best at predicting the first three characters (micro-average F1: 84.6%). For morphology codes, DeepSeek-R1-Distill-Llama-70B outperformed other models (exact microaverage F1: 34.7%; first three characters' microaverage F1: 77.8%). Large disparities between micro- and macroaverage F1-scores indicated poor generalization to rare conditions.

Conclusion: Although LLMs demonstrate promising capabilities as support systems for expert-guided pathology coding, their performance is not yet sufficient for fully automated, unsupervised use in routine clinical workflows. LLMs showed poor performance on rare conditions, heavy dependence on contextual information, and substantially lower scores for morphology versus topography classification.

MeSH terms

  • Clinical Coding* / methods
  • Humans
  • International Classification of Diseases*
  • Large Language Models
  • Neoplasms* / classification
  • Neoplasms* / diagnosis
  • Neoplasms* / pathology