Advanced analysis of leading large language models for diagnostic accuracy in retinal imaging

Br J Ophthalmol. 2026 Jan 14:bjo-2025-327634. doi: 10.1136/bjo-2025-327634. Online ahead of print.

Abstract

Background/aims: To evaluate and compare the diagnostic capabilities of advanced large language models (LLMs) in interpreting ophthalmological fundus images across diverse pathologies.

Methods: We evaluated eight leading multimodal LLMs (GPT-4.5, Claude 3.7 Sonnet, Grok-2, Deepseek Cognition V2, Qwen2 72B, Gemini 2.0 Pro, Llama 3 405B and Mixtral 8×22B) on their ability to interpret 100 fundus images representing various ophthalmological conditions. Performance was assessed using validated charts for diagnostic accuracy, specificity, sensitivity, consistency, relevance and explanation quality.

Results: GPT-4.5 achieved the highest overall diagnostic accuracy (65.0%), followed by Gemini 2.0 Pro (63.0%). All models showed varied performance across pathology categories, with rhegmatogenous pathologies being most accurately identified (Gemini 2.0 Pro: 81.3%, GPT-4.5: 75.0%) and myopic maculopathy (mean accuracy 21.8%) being particularly challenging. The remaining models performed significantly worse: Deepseek Cognition V2 (52.0%), Claude 3.7 Sonnet (52.0%), Qwen2 72B (49.0%), Llama 3 405B (48.0%), Grok-2 (47.0%) and Mixtral 8×22B (46.0%). Lower-performing models frequently declined to provide diagnoses, with refusal rates from 8.0% (Claude 3.7 Sonnet) to 19.0% (Mixtral 8×22B).

Conclusion: Current LLMs show promising but limited capabilities in ophthalmological image interpretation. While performance on common conditions like retinal detachments and age-related macular degeneration is moderately good, significant challenges remain with rare conditions, myopic pathologies and complex vascular disorders. The competitive performance between GPT-4.5 and Gemini 2.0 Pro, with each excelling in different pathology categories, suggests that leveraging their complementary strengths might offer improved diagnostic support.

Keywords: Diagnostic tests/Investigation; Imaging; Retina; Telemedicine.