Background/objectives: This paper aims to assess ChatGPT's performance in answering European Board of Ophthalmology Diploma (EBOD) examination papers and to compare these results to pass benchmarks and candidate results.
Methods: This cross-sectional study used a sample of past exam papers from 2012, 2013, 2020-2023 EBOD examinations. This study analyzed ChatGPT's responses to 440 multiple choice questions (MCQs), each containing five true/false statements (2200 statements in total) and 48 single best answer (SBA) questions.
Results: ChatGPT, for MCQs, scored on average 64.39%. ChatGPT's strongest metric performance for MCQs was precision (68.76%). ChatGPT performed best at answering pathology MCQs (Grubbs test p < 0.05). Optics and refraction had the lowest-scoring MCQ performance across all metrics. ChatGPT-3.5 Turbo performed worse than human candidates and ChatGPT-4o on easy questions (75% vs. 100% accuracy) but outperformed humans and ChatGPT-4o on challenging questions (50% vs. 28% accuracy). ChatGPT's SBA performance averaged 28.43%, with the highest score and strongest performance in precision (29.36%). Pathology SBA questions were consistently the lowest-scoring topic across most metrics. ChatGPT demonstrated a nonsignificant tendency to select option 1 more frequently (p = 0.19). When answering SBAs, human candidates scored higher than ChatGPT in all metric areas measured.
Conclusions: ChatGPT performed stronger for true/false questions, scoring a pass mark in most instances. Performance was poorer for SBA questions, suggesting that ChatGPT's ability in information retrieval is better than that in knowledge integration. ChatGPT could become a valuable tool in ophthalmic education, allowing exam boards to test their exam papers to ensure they are pitched at the right level, marking open-ended questions and providing detailed feedback.
Keywords: artificial intelligence; medical examinations; ophthalmology.