Assessing the quality of AI-generated and physician-written discharge summaries: evaluation of an EHR-integrated tool in a Dutch academic hospital

EBioMedicine. 2026 May:127:106247. doi: 10.1016/j.ebiom.2026.106247. Epub 2026 Apr 9.

Abstract

Background: Large language models (LLMs) offer potential to reduce administrative burden in clinical care by generating discharge summaries. Most prior evaluations have been limited to drafts, small cohorts, or non-integrated settings. Robust validation of fully automated, EHR-integrated systems in real-world practice is lacking.

Methods: This study was conducted in April 2025 at a Dutch academic hospital. A total of 292 paired discharge summaries from multiple departments were evaluated, each consisting of a physician-written and an LLM-generated version. Summaries were independently assessed by eight blinded clinicians using a 5-point Likert scale across completeness, correctness, and conciseness. Trustworthiness was also scored. Domain and total scores were compared with Wilcoxon signed-rank tests, and interrater reliability was quantified using Gwet's AC2.

Findings: LLM-generated summaries had lower completeness (4.50 (4.00-5.00) vs 5.00 (4.50-5.00); p < 0.001), similar correctness (5.00 (4.50-5.00) vs 5.00 (4.63-5.00); p = 0.14), and greater conciseness (5.00 (4.50-5.00) vs 4.50 (4.00-5.00); p < 0.001) compared with physician-written summaries. Total scores did not differ (14.00 (13.00-15.00) vs 14.00 (13.00-15.00); p = 0.34). Physician-written summaries were trusted by both reviewers in 279 (95.5%) cases, whereas LLM-generated summaries were trusted in 249 (85.3%) cases, partially trusted in 34 (11.6%), and rejected in 9 (3.1%). Interrater agreement for total scores was high (AC2 0.87, 95% CI 0.83-0.90 for LLM; 0.85, 95% CI 0.81-0.89 for physician).

Interpretation: Discharge summaries generated by an EHR-integrated LLM achieved quality ratings comparable to physician-written documents across multiple specialties, with no difference in total scores. Unlike earlier pilot work, this study demonstrates real-world feasibility of automated LLM use in clinical workflows at scale. With appropriate oversight and specialty-specific refinement, such systems could substantially reduce documentation burden while maintaining discharge summary quality.

Funding: This research did not receive a specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Keywords: Clinical documentation; Discharge summaries; Electronic health records; Health informatics; Large language models; Validation study.

MeSH terms

  • Academic Medical Centers
  • Electronic Health Records* / standards
  • Humans
  • Netherlands
  • Patient Discharge Summaries* / standards
  • Patient Discharge*
  • Physicians*
  • Reproducibility of Results