What question did this study set out to answer?

To assess the diagnostic accuracy and ordinal concordance of the Gemini 2.5-Pro model for classifying the Mayo Endoscopic Subscore in ulcerative colitis without prior training.

January 23, 2026

P0280Zero-Shot classification of the Mayo Endoscopic Subscore in Ulcerative Colitis: Diagnostic accuracy and ordinal concordance of gemini 2.5-pro

Key Points

To assess the diagnostic accuracy and ordinal concordance of the Gemini 2.5-Pro model for classifying the Mayo Endoscopic Subscore in ulcerative colitis without prior training.
Utilized a public expert-labelled dataset from the LIMUC repository
Employed a zero-shot approach with Gemini 2.5-Pro for classification
Assessed accuracy and ordinal performance using global metrics and error distance analysis
Calculated confidence intervals using Wilson-score and BCa bootstrap methods
Overall classification accuracy was 72.5% (95% CI: 65.9%-78.2%)
Achieved a Quadratic Weighted Kappa of 0.875, indicating near-perfect agreement
Mean Absolute Error was 0.29 (95% BCa CI: 0.22–0.36)
For the binary classification task (inactive/mild vs. moderate/severe), accuracy was 88.5% (95% CI: 0.833–0.922)

Abstract

Abstract Background Inter-observer variability in the Mayo Endoscopic Subscore (MES) for Ulcerative Colitis (UC) is a critical barrier (Kappa∼0.58) (4). While dedicated deep learning (CNN) models are accurate (10), their reliance on massive, fine-tuned datasets (11) creates a significant bottleneck. It remains unknown if general-purpose generative multimodal models can accurately classify the MES in a zero-shot (no training) setting.This proof-of-concept study quantified the diagnostic accuracy and ordinal concordance of a general-purpose multimodal model (Gemini 2.5-Pro) in a strict zero-shot configuration for MES classification, using a public, expert-labelled dataset (12) as the ground truth. Methods This STARD (13) and TRIPOD+AI (14) compliant study used a stratified, random, and balanced sample (N = 200; n = 50 per class, MES 0-3) from the public LIMUC repository (12). The index test was the Gemini 2.5-Pro model, provided with an image and a text prompt containing the official MES criteria, with no prior fine-tuning. The primary outcome was global accuracy. Key secondary outcomes included metrics per class, Mean Absolute Error (MAE), Quadratic Weighted Kappa (QWK) (19) for ordinal concordance, and performance on the clinically critical binary task (inactive/mild MES 0-1 vs. moderate/severe MES 2-3). 95% Confidence Intervals (CIs) were calculated using the Wilson-score method (20, 21) or the BCa bootstrap method (2,000 replicates) (22, 23). Results The model’s overall 4-class accuracy was 72.5% (145/200; 95% CI: 65.9%-78.2%). Ordinal performance was exceptionally strong, achieving a QWK of 0.875 (95% BCa CI: 0.833–0.909), indicating “Near Perfect” agreement (19). The low MAE of 0.29 (95% BCa CI: 0.22–0.36) corroborated this. Error distance analysis showed that 98.5% (197/200) of all predictions were either correct (distance 0) or an adjacent-class error (distance 1). Only 1.5% (n = 3) of errors were of distance 2, and no critical distance 3 errors occurred. For the clinically relevant binary task (MES 0-1 vs. 2-3), the model achieved an accuracy of 0.885 (95% CI: 0.833–0.922), with “Substantial” agreement (Kappa 0.770). Performance was highest at extremes, with 0.920 sensitivity for MES 0 and 0.973 specificity for MES 3. Conclusion This study provides the first evidence that a zero-shot generative multimodal model can classify the MES with “Near Perfect” ordinal concordance, bypassing the fine-tuning bottleneck. This level of ordinal performance suggests that foundation models possess a robust internal visual understanding of mucosal inflammation. This approach offers a viable, scalable pathway to mitigate human inter-observer variability (4) and standardise objective endoscopic assessment in clinical trials and routine practice. References: 1.Turner D, Ricciuto A, Lewis A, D’Amico F, Dhaliwal J, Griffiths AM, et al. STRIDE-II: An Update on the Selecting Therapeutic Targets in Inflammatory Bowel Disease Initiative of the IOIBD: Determining Therapeutic Goals for Treat-to-Target Strategies in IBD. Gastroenterology. 2021;160(5):1570-1583. doi:10.1053/j.gastro.2020.12.031. 2. Buchner AM, Farraye FA, Iacucci M. AGA Clinical Practice Update on Endoscopic Scoring Systems in Inflammatory Bowel Disease: Commentary. Clin Gastroenterol Hepatol. 2024;22(11):2188-2196. doi:10.1016/j.cgh.2024.06.048. 3. Di Ruscio M, Cedola M, Mangone M, Brighi S. How to assess endoscopic disease activity in ulcerative colitis in 2022. Ann Gastroenterol. 2022;35(5):462-470. doi:10.20524/aog.2022.0732. 4. Hashash JG, Jaoude JB, Kothari MM, Shao Y, Binion DG, Nanda KS, et al. Inter- and Intra-observer Variability in the Assessment of Inflammation in Inflammatory Bowel Disease: A Systematic Review and Meta-analysis. Inflamm Bowel Dis. 2024;30(9):1590-1607. doi:10.1093/ibd/izae093. 5. Raine T, Bonovas S, Burisch J, Kucharzik T, Adamina M, Annese V, et al. ECCO Guidelines on Therapeutics in Ulcerative Colitis: Medical Treatment. J Crohns Colitis. 2022;16(1):2-17. doi:10.1093/ecco-jcc/jjab178. 6. Dekker E, Nass KJ, Iacucci M, Murino A, Sabino J, Bugajski M, et al. Performance measures for colonoscopy in inflammatory bowel disease patients: European Society of Gastrointestinal Endoscopy (ESGE) Quality Improvement Initiative. Endoscopy. 2022 Sep;54(9):904-915. doi: 10.1055/a-1874-0946. 7. Gordon H, Jess T, Kirchgesner J, Rubin DT, et al. ECCO Guidelines on Inflammatory Bowel Disease and Malignancies. J Crohns Colitis. 2023 Jun 16;17(6):827-854. doi: 10.1093/ecco-jcc/jjac187. 8. Murthy SK, Feuerstein JD, Nguyen GC, Velayos FS. AGA Clinical Practice Update: Endoscopic surveillance and management of colorectal dysplasia in IBD. Gastroenterology. 2021;161(3):1043-1051. doi:10.1053/j.gastro.2021.05.066. 9.Kim JE, Choi YH, Lee YC, Seong G, Song JH, Kim TJ, et al. Deep learning model for distinguishing Mayo endoscopic subscore 0 and 1 in patients with ulcerative colitis. Sci Rep. 2023;13:11351. doi:10.1038/s41598-023-38206-6. 10. Stidham RW, Liu W, Bishu S, Rice MD, Higgins PDR, Zhu J, et al. Performance of a Deep Learning Model vs Human Reviewers in Grading Endoscopic Disease Severity of Patients With Ulcerative Colitis. JAMA Netw Open. 2019;2(5):e193963. doi:10.1001/jamanetworkopen.2019.3963. 11. Da Rio L, Germani U, Savarino E, et al. Artificial intelligence and inflammatory bowel disease. World J Gastrointest Endosc. 2023;15(1):1-19. doi:10.4253/wjge.v15.i1.1. Polat G, Kani HT, Ergenc I, Alahdab YO, Temizel A, Atug O. Labeled Images for Ulcerative Colitis (LIMUC) Dataset. Zenodo; 2022. doi:10.5281/zenodo.5827695. 12. Bossuyt PM, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou PP, Irwig L, et al. STARD 2015: An updated list of essential items for reporting diagnostic accuracy studies. BMJ. 2015;351:h5527. doi:10.1136/bmj.h5527. 13. Collins GS, Dhiman P, Andaur Navarro CL, Ma J, Hooft L, Riley RD, et al. TRIPOD+AI: Updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ. 2024;385:e078378. doi:10.1136/bmj-2023-078378. 14. Mongan J, Moy L, Kahn CE Jr. Checklist for Artificial Intelligence in Medical Imaging (CLAIM): A Guide for Authors and Reviewers. Radiol Artif Intell. 2020;2(2):e200029. doi:10.1148/ryai.2020200029. 15. Liu X, Rivera SC, Moher D, Calvert MJ, Denniston AK; SPIRIT-AI and CONSORT-AI Working Group. CONSORT-AI extension for clinical trials of AI interventions. Nat Med. 2020;26:1364-1374. doi:10.1038/s41591-020-1034-x. 16. OpenAI. GPT-4o System Card. arXiv preprint. 2024 Oct 25 citado 2025 Nov 14; arXiv:2410.21276. Disponível em: https://arxiv.org/abs/2410.21276. OpenAI. Model. Internet. OpenAI; c2024 citado 2025 Nov 14. Disponível em: https://platform.openai.com/docs/models/gpt-4o. 17. Cohen J. Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychol Bull. 1968;70(4):213-220. doi:10.1037/h0026256. 18. Wilson EB. Probable inference, the law of succession, and statistical inference. J Am Stat Assoc. 1927;22(158):209-212. doi:10.1080/01621459.1927.10502953. 19. Brown LD, Cai TT, DasGupta A. Interval estimation for a binomial proportion. Stat Sci. 2001;16(2):101-133. doi:10.1214/ss/1009213286. 20. Efron B. Bootstrap methods: Another look at the jackknife. Ann Stat. 1979;7(1):1-26. doi:10.1214/aos/1176344552. 21. Efron B. Better bootstrap confidence intervals. J Am Stat Assoc. 1987;82(397):171-185. doi:10.1080/01621459.1987.10478410. 22. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. J Mach Learn Res. 2011;12:2825-2830. 23. Python Software Foundation. What’s New In Python 3.11. Internet. Wilmington (DE): Python Software Foundation; 2022 citado 2025 Nov 14. Disponível em: https://docs.python.org/3/whatsnew/3.11.html. Conflict of interest: Prof. Dr. Da Silva Cornelio, Thiago: No conflict of interest Rodrigues de Freitas Junior, Wilson: No conflict of interest

Mark Helpful

Bookmark

Relay