Objective: Generative artificial intelligence is rapidly evolving and is now being explored in health care to support patient and clinician education. This study evaluated the accuracy, completeness, and readability of four large language models (LLMs): ChatGPT 3.5, Gemini, ChatGPT 4.0, and OpenEvidence in answering questions about menopause and hormone therapy.
Methods: A total of 35 questions (20 patient-level, 15 clinician-level) were entered into each LLM. OpenEvidence was only used for clinician-level questions. Four blinded expert reviewers rated responses as accurate and complete, accurate but incomplete, or inaccurate. Readability of patient-level responses was assessed using the Flesch Reading Ease Score (FRES) and word count. Analysis used ANOVA for readability, odds ratios for accuracy comparisons.
Results: For patient-level questions, ChatGPT 3.5 achieved the highest accuracy (70%), followed by ChatGPT 4.0 (60%) and Gemini (30%); Gemini had significantly lower odds of accuracy compared with ChatGPT 3.5 (OR=0.18, 95% CI=0.05-0.71; P=0.014). FRES scores differed significantly (P<0.001): Gemini scored 38.9±7.3 ("difficult"), ChatGPT 3.5 scored 31.0±11.2, and ChatGPT 4.0 scored 26.5±8.6 (both "very difficult"). For clinician-level questions, ChatGPT 4.0 achieved the highest accuracy (67%), followed by ChatGPT 3.5 and OpenEvidence (60% each) and Gemini (47%); no significant differences were observed among models (all P>0.05).
Conclusion: LLMs demonstrated limited accuracy and frequent incorrect or incomplete responses to menopause-related queries, highlighting the need to improve model performance to ensure accurate and reliable information for both patients and clinicians.
Keywords: Artificial intelligence; Clinician education; Large language models; Menopause education; Patient education..
Copyright © 2025 by The Menopause Society.