Comparative analysis of large language models and clinician responses in patient blood management knowledge

Minerva Anestesiol. 2025 Oct;91(10):909-918. doi: 10.23736/S0375-9393.25.19014-7. Epub 2025 Aug 5.

Abstract

Background: Large language models (LLMs) are increasingly used in the medical field and have the potential to reduce workload and improve treatment procedures in clinical practice. This study evaluates the capabilities of LLMs to answer common questions related to patient blood management (PBM) and compares their performance to the expertise of clinicians from two university hospitals.

Methods: To evaluate the performance of ChatGPT-3.5, ChatGPT-4o, and Google Gemini in answering PBM-related questions, we used a representative sample of 40 questions (30 single-choice and 10 frequently asked patient questions) and compared their responses to those of clinicians. The accuracy and interrater reliability of the answers were analyzed.

Results: For PBM knowledge-based questions, the proportion of correct answers was 96.4% (95% CI: 93.6-98.0%) for ChatGPT-4o, 81.3% (95% CI: 77.0-85.7%) for ChatGPT-3.5, and 84.0% (95% CI: 79.4-87.7%) for Google Gemini. Clinicians (N.=82) provided correct answers to 76.5% (95% CI: 74.7-78.1%) of the questions. For frequently asked patient questions, the proportion of correct answers was 100% for ChatGPT-4o, 95.5% (95% CI: 91.4-99.6%) for ChatGPT-3.5 and 91.7% (95% CI: 86.0-97.4%) for Google Gemini. Clinicians provided correct answers to 62.0% (95% CI: 58.7-65.3%) of the questions. Across the categories -anemia management, iron supplementation, cell salvage, principles of PBM, and blood transfusion- ChatGPT-4o achieved the highest scores, providing the most correct answers.

Conclusions: LLMs show strong potential for delivering accurate and comprehensive responses to common PBM-related questions. However, it remains essential for clinicians and patients to verify responses, particularly in critical situations.

Publication types

  • Comparative Study

MeSH terms

  • Blood Transfusion*
  • Humans
  • Language*
  • Large Language Models
  • Surveys and Questionnaires