Evaluating the diagnostic reasoning of large language models in complex neuro-ophthalmological cases: a comparative analysis of GPT-o1 Pro, GPT-4o, Gemini, Grok 2 and DeepSeek

BMJ Open Ophthalmol. 2025 Dec 4;10(1):e002185. doi: 10.1136/bmjophth-2025-002185.

Abstract

Purpose: This study aims to evaluate and compare the diagnostic reasoning of five large language models (LLMs) in complex neuro-ophthalmological cases. We assessed the performance of GPT-o1 Pro, GPT-4o, Google Gemini, Grok 2 and DeepSeek in handling clinical scenarios related to neuro-ophthalmology.

Method: 18 clinical scenarios, derived from six complex neuro-ophthalmological cases, were presented to five LLMs: GPT-o1 Pro, GPT-4o, Google Gemini, Grok 2 and DeepSeek. The responses generated by these models were evaluated using the Revised-IDEA (R-IDEA) assessment tool. R-IDEA scores for high-quality responses ranged from 6 to 10, with 'Excellent' responses defined as those scoring between 8 and 10. In addition, the simplicity of each response was evaluated based on word count using a readability tool.

Result: GPT-o1 Pro (8.80) significantly outperformed GPT-4o (6.80) and Grok 2 (6.94) in the R-IDEA scores (p=0.001). It achieved 100% high-quality responses, compared with 72.2% for GPT-4o, 77.8% for Grok 2 and 83.3% for both Gemini and DeepSeek (p=0.175). Regarding 'Excellent' responses, GPT-o1 Pro achieved 88.9% of its responses rated as Excellent, significantly outperforming the other models: 27.8% for GPT-4o, 38.9% for Grok 2 and 55.6% for both Gemini and DeepSeek (p=0.003). GPT-o1 Pro used the fewest words, showing significant differences compared with GPT-4o (p<0.001) and Gemini (p=0.032).

Conclusion: The study underscores the superior clinical reasoning capabilities of ChatGPT-o1 Pro in neuro-ophthalmology compared with other LLMs, highlighting its potential for enhancing diagnostic processes in this complex field.

Keywords: Diagnostic tests/Investigation; Optic Nerve.

Publication types

  • Comparative Study

MeSH terms

  • Clinical Reasoning*
  • Female
  • Humans
  • Language*
  • Large Language Models
  • Male
  • Ophthalmology*