Background/Aims To evaluate and compare the diagnostic capabilities of advanced large language models (LLMs) in interpreting ophthalmological fundus images across diverse pathologies. Methods We evaluated eight leading multimodal LLMs (GPT-4.5, Claude 3.7 Sonnet, Grok-2, Deepseek Cognition V2, Qwen2 72B, Gemini 2.0 Pro, Llama 3 405B and Mixtral 8×22B) on their ability to interpret 100 fundus images representing various ophthalmological conditions. Performance was assessed using validated charts for diagnostic accuracy, specificity, sensitivity, consistency, relevance and explanation quality. Results GPT-4.5 achieved the highest overall diagnostic accuracy (65.0%), followed by Gemini 2.0 Pro (63.0%). All models showed varied performance across pathology categories, with rhegmatogenous pathologies being most accurately identified (Gemini 2.0 Pro: 81.3%, GPT-4.5: 75.0%) and myopic maculopathy (mean accuracy 21.8%) being particularly challenging. The remaining models performed significantly worse: Deepseek Cognition V2 (52.0%), Claude 3.7 Sonnet (52.0%), Qwen2 72B (49.0%), Llama 3 405B (48.0%), Grok-2 (47.0%) and Mixtral 8×22B (46.0%). Lower-performing models frequently declined to provide diagnoses, with refusal rates from 8.0% (Claude 3.7 Sonnet) to 19.0% (Mixtral 8×22B). Conclusion Current LLMs show promising but limited capabilities in ophthalmological image interpretation. While performance on common conditions like retinal detachments and age-related macular degeneration is moderately good, significant challenges remain with rare conditions, myopic pathologies and complex vascular disorders. The competitive performance between GPT-4.5 and Gemini 2.0 Pro, with each excelling in different pathology categories, suggests that leveraging their complementary strengths might offer improved diagnostic support.
Building similarity graph...
Analyzing shared references across papers
Loading...
Matteo Mario Carlà
Emanuele Crincoli
Fiammetta Catania
British Journal of Ophthalmology
University of Messina
Agostino Gemelli University Polyclinic
Fondation Ophtalmologique Adolphe de Rothschild
Building similarity graph...
Analyzing shared references across papers
Loading...
Carlà et al. (Wed,) studied this question.
www.synapsesocial.com/papers/6969d488940543b9777097bb — DOI: https://doi.org/10.1136/bjo-2025-327634