Purpose: This study aims to evaluate and compare the diagnostic reasoning of five large language models (LLMs) in complex neuro-ophthalmological cases. We assessed the performance of GPT-o1 Pro, GPT-4o, Google Gemini, Grok 2 and DeepSeek in handling clinical scenarios related to neuro-ophthalmology.
Method: 18 clinical scenarios, derived from six complex neuro-ophthalmological cases, were presented to five LLMs: GPT-o1 Pro, GPT-4o, Google Gemini, Grok 2 and DeepSeek. The responses generated by these models were evaluated using the Revised-IDEA (R-IDEA) assessment tool. R-IDEA scores for high-quality responses ranged from 6 to 10, with 'Excellent' responses defined as those scoring between 8 and 10. In addition, the simplicity of each response was evaluated based on word count using a readability tool.
Result: GPT-o1 Pro (8.80) significantly outperformed GPT-4o (6.80) and Grok 2 (6.94) in the R-IDEA scores (p=0.001). It achieved 100% high-quality responses, compared with 72.2% for GPT-4o, 77.8% for Grok 2 and 83.3% for both Gemini and DeepSeek (p=0.175). Regarding 'Excellent' responses, GPT-o1 Pro achieved 88.9% of its responses rated as Excellent, significantly outperforming the other models: 27.8% for GPT-4o, 38.9% for Grok 2 and 55.6% for both Gemini and DeepSeek (p=0.003). GPT-o1 Pro used the fewest words, showing significant differences compared with GPT-4o (p<0.001) and Gemini (p=0.032).
Conclusion: The study underscores the superior clinical reasoning capabilities of ChatGPT-o1 Pro in neuro-ophthalmology compared with other LLMs, highlighting its potential for enhancing diagnostic processes in this complex field.
Keywords: Diagnostic tests/Investigation; Optic Nerve.
© Author(s) (or their employer(s)) 2025. Re-use permitted under CC BY-NC. No commercial re-use. See rights and permissions. Published by BMJ Group.