• GPT-5.2 demonstrated superior and more robust performance in binarized diabetic retinopathy referral tasks compared to GPT-4 and Gemini 3. • Gemini 3 exhibited significant vulnerability to "textual confirmation bias," leading to changes in clinical recommendations in over 50% of cases when presented with conflicting metadata. • All multimodal LLMs exhibited degenerate predictive behavior on glaucoma tasks, with near-zero Kappa scores indicating a lack of true discriminative ability for spatial nerve assessment. • GPT-4 and GPT-5.2 displayed "vision-dominant" stability, maintaining initial visual interpretations even when faced with contradictory clinical narratives. • Native multimodal integration enables direct pixel-level processing, yet LLMs currently serve better as integrative clinical assistants than as standalone diagnostic tools. To evaluate the diagnostic capability of state-of-the-art multimodal Large Language Models (LLMs) in autonomous interpretation of raw image input (fundus photographs) and to assess their susceptibility to "textual confirmation bias" when integrated with real or synthetic clinical metadata. Retrospective diagnostic validation study. A cohort of 265 retinal fundus images was sampled from the Brazilian Ophthalmological Dataset (mBRSET), representing a diverse distribution of diabetic retinopathy stages and suspected glaucoma cases. Three LLMs (GPT-4, GPT-5.2, and Gemini 3) were evaluated via native multimodal integration. Each model was tested under three multimodal input conditions:(1) Image-Only Multimodal Input, (2) Metadata-Enhanced Multimodal Input, and (3) Synthetic Metadata Multimodal Input. Primary tasks included ICDR severity grading, binary DR referral, and an increased Vertical Cup-to-Disc Ratio (VCDR > 0.6). Metrics included accuracy, Cohen’s kappa, and macro F1-score. GPT-5.2 emerged as the top performer for DR tasks, achieving the highest ICDR accuracy (72.2%) and DR referral kappa (0.51). In contrast, all models demonstrated degenerate prediction behavior for glaucoma referral proxy, with moderate accuracy (76.3%) but near-zero kappa scores (<0.13), indicating performance no better than chance. GPT-4 and GPT-5.2 exhibited "vision-dominant" stability, remaining largely unaffected by conflicting metadata. However, Gemini 3 exhibited high volatility (textual confirmation bias), changing clinical recommendations in 52.4% of cases when given synthetic metadata. Under this experimental setup, current multimodal LLMs can perform structured ophthalmic tasks but remain unreliable for autonomous clinical screening. While OpenAI models show relative robustness to conflicting context, Gemini 3 was highly susceptible to metadata-induced hallucinations in this framework. These findings highlight that LLMs are better suited as integrative assistants for documentation and research annotation rather than standalone diagnostic tools, emphasizing the need for vision-dominant architectures in medical AI.
Building similarity graph...
Analyzing shared references across papers
Loading...
Cindy Lie Tabuse
David Restrepo
Carolina P. B. Gracitelli
Building similarity graph...
Analyzing shared references across papers
Loading...
Tabuse et al. (Wed,) studied this question.
www.synapsesocial.com/papers/69d0af36659487ece0fa51e5 — DOI: https://doi.org/10.1016/j.ajoint.2026.100253
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: