Abstract Multi-modal large language models (MLLMs) are increasingly demonstrating significant potential in medical applications, particularly in image-intensive fields such as ophthalmology. While state-of-the-art models like ChatGPT-4o and Qwen-VL 2.5 exhibit impressive performance in general-domain tasks, there remains a lack of real-world clinical benchmark datasets to rigorously evaluate their diagnostic capabilities in specialized medical contexts. To address this gap, we constructed a curated benchmark dataset comprising 295 pathologically confirmed ophthalmic cases with representative clinical presentations. Using this dataset, we conducted a systematic evaluation of nine leading MLLMs, both open-source and proprietary. Our results reveal that models such as HAIBU-REMUD, ChatGPT-4o and Gemini 2.5 achieve high diagnostic accuracy and strong consistency, with performance approaching that of human experts. These findings suggest that current MLLMs have reached a promising stage in terms of applicability to real-world clinical settings, laying the groundwork for their integration into ophthalmology practice.
Building similarity graph...
Analyzing shared references across papers
Loading...
Weihua Yang
Shoujun Huang
Junhong Chen
Southern Medical University
Zhejiang Normal University
Eye Center
Building similarity graph...
Analyzing shared references across papers
Loading...
Yang et al. (Wed,) studied this question.
www.synapsesocial.com/papers/689a0621e6551bb0af8cdc1f — DOI: https://doi.org/10.21203/rs.3.rs-7186903/v1