July 23, 2025

Benchmark Evaluation of Multi-Modal Large Language Models for Ophthalmic Diagnosis

Key Points

Nine leading multi-modal large language models were systematically evaluated for ophthalmic diagnosis.
The benchmark dataset contained 295 clinically confirmed ophthalmic cases for thorough analysis.
Models like ChatGPT-4o and HAIBU-REMUD showed diagnostic accuracy nearly matching human experts.
These findings highlight the potential of integrating multi-modal large language models into ophthalmology practice.

Abstract

Abstract Multi-modal large language models (MLLMs) are increasingly demonstrating significant potential in medical applications, particularly in image-intensive fields such as ophthalmology. While state-of-the-art models like ChatGPT-4o and Qwen-VL 2.5 exhibit impressive performance in general-domain tasks, there remains a lack of real-world clinical benchmark datasets to rigorously evaluate their diagnostic capabilities in specialized medical contexts. To address this gap, we constructed a curated benchmark dataset comprising 295 pathologically confirmed ophthalmic cases with representative clinical presentations. Using this dataset, we conducted a systematic evaluation of nine leading MLLMs, both open-source and proprietary. Our results reveal that models such as HAIBU-REMUD, ChatGPT-4o and Gemini 2.5 achieve high diagnostic accuracy and strong consistency, with performance approaching that of human experts. These findings suggest that current MLLMs have reached a promising stage in terms of applicability to real-world clinical settings, laying the groundwork for their integration into ophthalmology practice.

Ask AI

Helpful

Bookmark

View Full Paper