What question did this study set out to answer?

This research aims to assess the accuracy and reliability of multimodal large language models in image-based medical education.

May 15, 2026Open Access

Performance of multimodal large language models on image‐based surgical anatomy, anatomical pathology, and radiology questions

Key Points

This research aims to assess the accuracy and reliability of multimodal large language models in image-based medical education.
Evaluated four multimodal LLMs on 208 image-based questions from a Doctor of Medicine program.
Questions covered anatomical pathology, radiology, and surgical anatomy, presented in image-only format.
Analysis included recognition-only and recognition-plus-reasoning item formats.
ChatGPT-5.1 achieved the highest accuracy at 75.5% (95% CI [69.2-80.8]).
Significant performance variation was noted across models (p < 0.001; Cramér's V = 0.45).
All models showed higher accuracy for recognition-only and selected-response items.

Abstract

Multimodal large language models (LLMs) are now deeply integrated into medical education and widely used by medical students, yet it remains unclear whether current models possess the accuracy and reliability needed to support image-based learning. We evaluated four state-of-the-art multimodal LLMs (ChatGPT-5.1, Gemini-2.5, Grok-4, Claude Sonnet-4.5) on 208 image-based examination questions from a Doctor of Medicine program, spanning anatomical pathology (histopathology; 47.6%), radiology (31.7%), and surgical anatomy (20.7%). To isolate visual reasoning, all items were presented in image-only form with contextual information removed. Items covered seven organ systems, included both constructed-response and selected-response formats, and were categorized as recognition-only or recognition-plus-reasoning. ChatGPT-5.1 achieved the highest accuracy (75.5%; 95% CI 69.2-80.8), followed by Gemini-2.5 (59.6%; 95% CI 52.8-66.1), Claude Sonnet-4.5 (41.8%; 95% CI 35.3-48.6), and Grok-4 (34.6%; 95% CI 28.5-41.3). Overall model performance differed significantly (p Gemini > Claude ≈ Grok) across different categories. Accuracy was uniformly higher for recognition-only and selected-response items. Even the best-performing model, ChatGPT-5.1, answered approximately one in four questions incorrectly. These findings suggest that current multimodal LLMs cannot yet replace expert teaching in image-based learning. Their use in medical education should therefore remain supervised and critically appraised, serving as adjuncts rather than authoritative sources.

Read Full Paperexternally

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper