Objective: This study aims to evaluate whether large language models (LLMs) can accurately predict the urgency and severity of radiology reports.
Materials and methods: Based on the recommendations of the Academy of Royal Colleges, we defined radiology reports that include unexpected findings of high urgency or severity as "high-priority (HP) radiology reports." Overall, 1906 radiology reports were used as the training set, and 176 radiology reports were used as the test set, with a balanced ratio of HP to non-HP radiology reports (1:1) in both sets. Four types of LLMs (Llama2 7B, Llama3 8B, Llama3 Elyza 8B, and Llama 3.1 8B) were fine-tuned using four different input settings: (1) findings only, (2) findings + referring department, (3) findings + referring department + clinical diagnosis before examination, and (4) findings + referring department + clinical diagnosis before examination + details of examination request. The fine-tuned LLMs predicted whether each radiology report was HP or not.
Results: Among the four LLMs, Llama3 Elyza 8B, with inputs comprising findings and the referring department, demonstrated the best performance, achieving PRAUC = 0.962, ROCAUC = 0.968, accuracy = 0.915, sensitivity/recall = 0.932, specificity = 0.898, and F1 = 0.916. Adding a clinical diagnosis before the examination and details of examination requests did not necessarily lead to performance improvement.
Conclusion: The fine-tuned LLMs accurately predicted HP radiology reports, suggesting their potential utility in supporting communication regarding radiology reports with high urgency or severity.
Key points: Question This study aims to evaluate whether large language models (LLMs) can accurately predict the high-priority (HP) radiology reports. Findings The fine-tuned best LLM accurately HP radiology reports, achieving PRAUC of 0.962 and ROCAUC of 0.968. Clinical relevance This study demonstrates that fine-tuned LLMs can accurately identify HP radiology reports, potentially improving timely clinical decision-making and enhancing patient safety through faster communication of critical findings.
Keywords: Deep learning; Generative AI; Large language model; Radiology report; Safety.
© 2025. The Author(s), under exclusive licence to European Society of Radiology.