Key points are not available for this paper at this time.
Abstract Background and Aim Visual data from images is essential for many medical diagnoses. This study evaluates the performance of multimodal Large Language Models (LLMs) in integrating textual and visual information for diagnostic purposes. Methods We tested GPT-4o and Claude Sonnet 3.5 on 120 clinical vignettes with and without accompanying images. Each vignette included patient demographics, a chief complaint, and relevant medical history. Vignettes were paired with either clinical or radiological images from two sources: 100 images from the OPENi database and 20 images from recent NEJM challenges, ensuring they were not in the LLMs’ training sets. Three primary care physicians served as a human benchmark. We analyzed diagnostic accuracy and the models’ explanations for a subset of cases. Results LLMs outperformed physicians in text-only scenarios (GPT-4o: 70.8%, Claude Sonnet 3.5: 59.5%, Physicians: 39.5%). With image integration, all improved, but physicians showed the largest gain (GPT-4o: 84.5%, p<0.001; Claude Sonnet 3.5: 67.3%, p=0.060; Physicians: 78.8%, p<0.001). LLMs changed their explanations in 45-60% of cases when presented with images, demonstrating some level of visual data integration. Conclusion Multimodal LLMs show promise in medical diagnosis, with improved performance when integrating visual evidence. However, this improvement is inconsistent and smaller compared to physicians, indicating a need for enhanced visual data processing in these models.
Agbareia et al. (Tue,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: