What question did this study set out to answer?

The study aims to evaluate the diagnostic capability of multimodal large language models in interpreting fundus images and their susceptibility to confirmation bias.

April 4, 2026Open Access

Multimodal Large Language Models in Ophthalmology: Diagnostic Accuracy and the Risk of Metadata-Induced Confirmation Bias

Key Points

The study aims to evaluate the diagnostic capability of multimodal large language models in interpreting fundus images and their susceptibility to confirmation bias.
Conducted a retrospective diagnostic validation study using 265 retinal fundus images from the Brazilian Ophthalmological Dataset.
Evaluated three LLMs (GPT-4, GPT-5.2, Gemini 3) under various multimodal input conditions: image-only, metadata-enhanced, and synthetic metadata.
Primary tasks included severity grading for diabetic retinopathy and glaucoma assessment, using metrics like accuracy, Kappa, and macro F1-score.
GPT-5.2 achieved the highest ICDR accuracy at 72.2% and a Kappa score of 0.51 for diabetic retinopathy tasks.
All models showed poor performance on glaucoma tasks, indicated by near-zero Kappa scores (<0.13).
Gemini 3 was significantly affected by textual confirmation bias, altering clinical recommendations in over 52% of cases.

Abstract

• GPT-5.2 demonstrated superior and more robust performance in binarized diabetic retinopathy referral tasks compared to GPT-4 and Gemini 3. • Gemini 3 exhibited significant vulnerability to "textual confirmation bias," leading to changes in clinical recommendations in over 50% of cases when presented with conflicting metadata. • All multimodal LLMs exhibited degenerate predictive behavior on glaucoma tasks, with near-zero Kappa scores indicating a lack of true discriminative ability for spatial nerve assessment. • GPT-4 and GPT-5.2 displayed "vision-dominant" stability, maintaining initial visual interpretations even when faced with contradictory clinical narratives. • Native multimodal integration enables direct pixel-level processing, yet LLMs currently serve better as integrative clinical assistants than as standalone diagnostic tools. To evaluate the diagnostic capability of state-of-the-art multimodal Large Language Models (LLMs) in autonomous interpretation of raw image input (fundus photographs) and to assess their susceptibility to "textual confirmation bias" when integrated with real or synthetic clinical metadata. Retrospective diagnostic validation study. A cohort of 265 retinal fundus images was sampled from the Brazilian Ophthalmological Dataset (mBRSET), representing a diverse distribution of diabetic retinopathy stages and suspected glaucoma cases. Three LLMs (GPT-4, GPT-5.2, and Gemini 3) were evaluated via native multimodal integration. Each model was tested under three multimodal input conditions:(1) Image-Only Multimodal Input, (2) Metadata-Enhanced Multimodal Input, and (3) Synthetic Metadata Multimodal Input. Primary tasks included ICDR severity grading, binary DR referral, and an increased Vertical Cup-to-Disc Ratio (VCDR > 0.6). Metrics included accuracy, Cohen’s kappa, and macro F1-score. GPT-5.2 emerged as the top performer for DR tasks, achieving the highest ICDR accuracy (72.2%) and DR referral kappa (0.51). In contrast, all models demonstrated degenerate prediction behavior for glaucoma referral proxy, with moderate accuracy (76.3%) but near-zero kappa scores (<0.13), indicating performance no better than chance. GPT-4 and GPT-5.2 exhibited "vision-dominant" stability, remaining largely unaffected by conflicting metadata. However, Gemini 3 exhibited high volatility (textual confirmation bias), changing clinical recommendations in 52.4% of cases when given synthetic metadata. Under this experimental setup, current multimodal LLMs can perform structured ophthalmic tasks but remain unreliable for autonomous clinical screening. While OpenAI models show relative robustness to conflicting context, Gemini 3 was highly susceptible to metadata-induced hallucinations in this framework. These findings highlight that LLMs are better suited as integrative assistants for documentation and research annotation rather than standalone diagnostic tools, emphasizing the need for vision-dominant architectures in medical AI.

Read Full Paperexternally

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper