Multimodal large language models (LLMs) are now deeply integrated into medical education and widely used by medical students, yet it remains unclear whether current models possess the accuracy and reliability needed to support image-based learning. We evaluated four state-of-the-art multimodal LLMs (ChatGPT-5.1, Gemini-2.5, Grok-4, Claude Sonnet-4.5) on 208 image-based examination questions from a Doctor of Medicine program, spanning anatomical pathology (histopathology; 47.6%), radiology (31.7%), and surgical anatomy (20.7%). To isolate visual reasoning, all items were presented in image-only form with contextual information removed. Items covered seven organ systems, included both constructed-response and selected-response formats, and were categorized as recognition-only or recognition-plus-reasoning. ChatGPT-5.1 achieved the highest accuracy (75.5%; 95% CI 69.2-80.8), followed by Gemini-2.5 (59.6%; 95% CI 52.8-66.1), Claude Sonnet-4.5 (41.8%; 95% CI 35.3-48.6), and Grok-4 (34.6%; 95% CI 28.5-41.3). Overall model performance differed significantly (p Gemini > Claude ≈ Grok) across different categories. Accuracy was uniformly higher for recognition-only and selected-response items. Even the best-performing model, ChatGPT-5.1, answered approximately one in four questions incorrectly. These findings suggest that current multimodal LLMs cannot yet replace expert teaching in image-based learning. Their use in medical education should therefore remain supervised and critically appraised, serving as adjuncts rather than authoritative sources.
Lu et al. (Wed,) studied this question.