What question did this study set out to answer?

The study aims to evaluate and compare the diagnostic capacities of leading large language models in interpreting retinal fundus images.

January 16, 2026

View Full Paper

Advanced analysis of leading large language models for diagnostic accuracy in retinal imaging

Key Points

The study aims to evaluate and compare the diagnostic capacities of leading large language models in interpreting retinal fundus images.
Evaluated eight multimodal large language models (LLMs) on 100 fundus images.
Assessed performance using validated charts for diagnostic metrics.
Compared diagnostic accuracy, specificity, sensitivity, and explanation quality across different pathologies.
GPT-4.5 achieved the highest diagnostic accuracy at 65.0% followed by Gemini 2.0 Pro at 63.0%.
Models performed best on rhegmatogenous pathologies with accuracy rates of 81.3% and 75.0%, respectively.
Mean accuracy for challenging myopic maculopathy was only 21.8%.
Lower-performing models showed significantly reduced accuracy and higher refusal rates to diagnose.

Abstract

Background/Aims To evaluate and compare the diagnostic capabilities of advanced large language models (LLMs) in interpreting ophthalmological fundus images across diverse pathologies. Methods We evaluated eight leading multimodal LLMs (GPT-4.5, Claude 3.7 Sonnet, Grok-2, Deepseek Cognition V2, Qwen2 72B, Gemini 2.0 Pro, Llama 3 405B and Mixtral 8×22B) on their ability to interpret 100 fundus images representing various ophthalmological conditions. Performance was assessed using validated charts for diagnostic accuracy, specificity, sensitivity, consistency, relevance and explanation quality. Results GPT-4.5 achieved the highest overall diagnostic accuracy (65.0%), followed by Gemini 2.0 Pro (63.0%). All models showed varied performance across pathology categories, with rhegmatogenous pathologies being most accurately identified (Gemini 2.0 Pro: 81.3%, GPT-4.5: 75.0%) and myopic maculopathy (mean accuracy 21.8%) being particularly challenging. The remaining models performed significantly worse: Deepseek Cognition V2 (52.0%), Claude 3.7 Sonnet (52.0%), Qwen2 72B (49.0%), Llama 3 405B (48.0%), Grok-2 (47.0%) and Mixtral 8×22B (46.0%). Lower-performing models frequently declined to provide diagnoses, with refusal rates from 8.0% (Claude 3.7 Sonnet) to 19.0% (Mixtral 8×22B). Conclusion Current LLMs show promising but limited capabilities in ophthalmological image interpretation. While performance on common conditions like retinal detachments and age-related macular degeneration is moderately good, significant challenges remain with rare conditions, myopic pathologies and complex vascular disorders. The competitive performance between GPT-4.5 and Gemini 2.0 Pro, with each excelling in different pathology categories, suggests that leveraging their complementary strengths might offer improved diagnostic support.

Ask AI

Helpful

Bookmark

View Full Paper

Ask AI

Helpful

Bookmark

View Full Paper

Advanced analysis of leading large language models for diagnostic accuracy in retinal imaging

Key Points

Abstract

Cite This Study