What question did this study set out to answer?

This study aims to evaluate the diagnostic accuracy of multimodal large language models in classifying colorectal polyps and predicting histology.

February 16, 2026Open Access

Poster Session I - A55 PERFORMANCE OF LARGE LANGUAGE MODELS IN THE OPTICAL DIAGNOSIS OF COLORECTAL POLYPS

Key Points

This study aims to evaluate the diagnostic accuracy of multimodal large language models in classifying colorectal polyps and predicting histology.
Conducted a retrospective diagnostic accuracy study using the PRIME dataset with white light and NBI images.
Evaluated MLLMs including Claude Opus 4, Google Gemini 2.5 Pro, GPT-o3, GPT-4o, and GPT-5 for classification performance.
Calculated percent correct scores and accuracy for 132 cases using NICE and JNET classifications.
Analyzed additional 82 cases with JNET classification to compare MLLM predictions to expert responses.
Claude Opus 4 and GPT-5 achieved significantly higher accuracy at 41.7% for Paris classification.
NICE classification yielded the highest percent correct scores ranging between 78.0% to 84.1%.
All MLLMs had accuracy scores exceeding 80% for neoplastic versus non-neoplastic polyps.
Sensitivity and specificity did not meet ESGE standards, indicating further research is required.

Abstract

Abstract Background Optical diagnosis allows for rapid endoscopic decision-making, but many practitioners are inadequately trained. Large language models, such as Anthropic’s Claude Opus 4, have not been evaluated in their ability to apply the NICE or JNET classification systems to colorectal polyps to more reliably predict lesion histologies. Additionally, there is a limited reference base for ideal prompting strategies in gastrointestinal endoscopy. Aims The diagnostic accuracy of multimodal large language models (MLLMs) in classifying colorectal polyps and predicting histology will be comparable to societal (ASGE and ESGE) guidelines and to that of traditional computer-aided diagnostic systems. Methods We conducted a retrospective diagnostic accuracy study using the PRIME dataset, a curated set of white light and narrow-band imaging (NBI) images. We evaluated Claude Opus 4, Google Gemini 2.5 Pro, GPT-o3, GPT-4o, and GPT-5. For Paris, Narrow-band Imaging Colorectal Endoscopic (NICE), and classifications and predicted histology, we calculated percent correct scores and accuracy of each MLLM compared to expert responses for 132 cases. For the Japanese NBI Expert Team (JNET) classification, we analyzed 82 cases. We calculated accuracy, sensitivity, specificity, and positive and negative predictive values. Cochran’s Q and McNemar’s Test were used to determine differences between the predicted values of each MLLM. Results Claude Opus 4 and GPT-5 had significantly higher percent correct scores than other MLLMs at 41.7%, using Paris classification. The highest percent correct score range among all the classifications was NICE, which had a range of 78.0% to 84.1%. The accuracy scores among MLLMs were 80% for all models for neoplastic vs. non-neoplastic polyps. Conclusions Claude Opus 4 and Gemini 2.5 Pro showed the highest accuracy in differentiating colorectal polyp subtypes, performing closest to expert consensus. Sensitivity and specificity, however, did not meet ESGE standards, highlighting the need for prospective multicenter trials and the design of human-in-the-loop workflows before clinical deployment. Funding Agencies Pialis Family Chair in Education

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper