Evaluating Medical Text Summaries Using Automatic Evaluation Metrics and LLM-as-a-Judge Approach: A Pilot Study

Diagnostics (Basel). 2025 Dec 19;16(1):3. doi: 10.3390/diagnostics16010003.

Abstract

Background: Electronic health records (EHRs) remain a vital source of clinical information, yet processing these heterogeneous data is extremely labor-intensive. Summarization of these data using Large Language Models (LLMs) is considered a promising tool to support practicing physicians. Unbiased, automated quality control is crucial for integrating the tools into routine practice, saving time and labor. This pilot study aimed to assess the potential and constraints of self-contained evaluation of summarization quality (without expert involvement) based on automatic evaluation metrics and LLM-as-a-judge. Methods: The summaries of text data from 30 EHRs were generated by six open-source low-parameter LLMs. The medical summaries were evaluated using standard automatic metrics (BLEU, ROUGE, METEOR, BERTScore) as well as the LLM-as-a-judge approach using the following criteria: relevance, completeness, redundancy, coherence and structure, grammar and terminology, and hallucinations. Expert evaluation was conducted using the same criteria. Results: The results showed that LLMs hold great promise for summarizing medical data. Nevertheless, neither the evaluation metrics nor LLM judges are reliable in detecting factual errors and semantic distortions (hallucinations). In terms of relevance, the Pearson correlation between the summary quality score and the expert opinions was 0.688. Conclusions: Completely automating the evaluation of medical summaries remains challenging. Further research should focus on dedicated methods for detecting hallucinations, along with investigating larger or specialized models trained on medical texts. Additionally, the potential integration of retrieval-augmented generation (RAG) within the LLM-as-a-judge architecture deserves attention. Nevertheless, even now, the combination of LLMs and the automatic evaluation metrics can underpin medical decision support systems by performing initial evaluations and highlighting potential shortcomings for expert review.

Keywords: LLM-as-a-judge; electronic health records; large language model; summaries.