Abstract Background To systematically compare the diagnostic accuracy of five contemporary multimodal large language models (MLLMs: Gemini-2.5-Pro, Grok-4, GPT-4o, GPT-5, and Qwen-VL-Max) in evaluating the Mayo Endoscopic Score (MES) for ulcerative colitis (UC), and to explore their consistency and performance across various intestinal segments and MES categories. Methods 402 authentic endoscopic images from patients with UC were collected, covering the entire colon location from the ileocecal region to the rectum. Three experienced inflammatory bowel disease (IBD) experts independently reviewed and unanimously graded these images and finally 283 images with consensus on grades. These images were among the second stage of research and the grades as the reference standard. These images were randomly presented to MLLMs and two senior IBD physicians without specifying the intestinal segment, and then randomly presented to MLLMs with segmental information before grade. Model and physician performance were compared, and stratified analyses were conducted by intestinal segment and MES grade. Results The diagnostic accuracies (Acc) of the two IBD physicians were 81.6% and 78.4%, respectively, with strong inter-observer agreement (κ = 0.692). Among these MLLMs, GPT-5 achieved the highest overall performance (F1: GPT-5 0.720 GPT-4o 0.602 Gemini-2.5-Pro 0.480 Grok-4 0.415 Qwen-VL-Max 0.338), and its diagnostic accuracy was comparable to that of human physicians (GPT-5 Acc 71.7% vs. Senior Physician 2 Acc 78.4%, P = 0.068). The other models exhibited significantly poor diagnostic performance compared with experienced IBD physicians (all P 0.001). The sigmoid colon was the most accurately assessed region (mean F1 across models 0.682), whereas the rectum and ileocecal region remained the most challenging (0.447 and 0.493, respectively). The provision of segmental information significantly enhanced the performance of the poor-performing models. Moreover, both the models and human physicians showed the lowest accuracy at MES=1(physicians mean Acc≈60.3 %, models mean Acc≈39.4 %), indicating that the mild-activity grade remains the most challenging to classify due to its inherent subjectivity. Conclusion GPT-5 demonstrated diagnostic performance comparable to that of senior IBD physicians in MES grading, whereas other MLLMs require further optimization. In terms of intestinal segments, the rectum and ileocecal region, and in terms of disease severity, mild activity (MES = 1), represented common challenges for both physicians and models. Future efforts should be focused on targeted training for these challenging segments and grades, and integrating additional clinical multimodal data to advance the clinical implementation of intelligent endoscopic assessment. References: 1.Gemini, T. et al. Gemini: A Family of Highly Capable Multimodal Models. arXiv e-prints, arXiv:2312.11805 (2023). https://doi.org/10.48550/arXiv.2312.118059 2.Levartovsky, A. et al. Enhancing diagnostics: ChatGPT-4 performance in ulcerative colitis endoscopic assessment. Endosc Int Open 13, a25420943 (2025). https://doi.org/10.1055/a-2542-094310 3.Levartovsky, A., Ben-Horin, S., Kopylov, U., Klang, E. & Barash, Y. Towards AI-Augmented Clinical Decision-Making: An Examination of ChatGPT’s Utility in Acute Ulcerative Colitis Presentations. Am J Gastroenterol 118, 2283–2289 (2023). https://doi.org/10.14309/ajg.000000000000248311 4.Sciberras, M. et al. Accuracy of Information given by ChatGPT for Patients with Inflammatory Bowel Disease in Relation to ECCO Guidelines. J Crohns Colitis 18, 1215–1221 (2024). https://doi.org/10.1093/ecco-jcc/jjae040 Conflict of interest: Zhao, Xiaoyi: No conflict of interest Qiang, Zhan: No conflict of interest Jing, Sun: No conflict of interest
Zhao et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: