Abstract Background Optical diagnosis allows for rapid endoscopic decision-making, but many practitioners are inadequately trained. Large language models, such as Anthropic’s Claude Opus 4, have not been evaluated in their ability to apply the NICE or JNET classification systems to colorectal polyps to more reliably predict lesion histologies. Additionally, there is a limited reference base for ideal prompting strategies in gastrointestinal endoscopy. Aims The diagnostic accuracy of multimodal large language models (MLLMs) in classifying colorectal polyps and predicting histology will be comparable to societal (ASGE and ESGE) guidelines and to that of traditional computer-aided diagnostic systems. Methods We conducted a retrospective diagnostic accuracy study using the PRIME dataset, a curated set of white light and narrow-band imaging (NBI) images. We evaluated Claude Opus 4, Google Gemini 2.5 Pro, GPT-o3, GPT-4o, and GPT-5. For Paris, Narrow-band Imaging Colorectal Endoscopic (NICE), and classifications and predicted histology, we calculated percent correct scores and accuracy of each MLLM compared to expert responses for 132 cases. For the Japanese NBI Expert Team (JNET) classification, we analyzed 82 cases. We calculated accuracy, sensitivity, specificity, and positive and negative predictive values. Cochran’s Q and McNemar’s Test were used to determine differences between the predicted values of each MLLM. Results Claude Opus 4 and GPT-5 had significantly higher percent correct scores than other MLLMs at 41.7%, using Paris classification. The highest percent correct score range among all the classifications was NICE, which had a range of 78.0% to 84.1%. The accuracy scores among MLLMs were 80% for all models for neoplastic vs. non-neoplastic polyps. Conclusions Claude Opus 4 and Gemini 2.5 Pro showed the highest accuracy in differentiating colorectal polyp subtypes, performing closest to expert consensus. Sensitivity and specificity, however, did not meet ESGE standards, highlighting the need for prospective multicenter trials and the design of human-in-the-loop workflows before clinical deployment. Funding Agencies Pialis Family Chair in Education
Vences et al. (Sun,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: