November 27, 2025Open Access

Vision language models versus machine learning models performance on polyp detection and classification in colonoscopy images

Key Points

Key points are not available for this paper at this time.

Abstract

Medical image analysis is central to clinical decision-making, and recent advances in vision–language models (VLMs) have introduced promising capabilities for jointly processing visual and textual data. This study evaluates zero-shot VLMs against convolutional neural networks (CNNs) and classical machine learning (CML) models for polyp detection (CADe) and classification (CADx) using 2,258 colonoscopy images from 428 patients with histopathological labels. We benchmarked 15 approaches including ResNet50, five CMLs (random forest, support vector machine, logistic regression, decision tree, Gaussian naive Bayes), two contrastive vision–language encoders (CLIP, BiomedCLIP), and seven frontier VLMs (GPT-4, GPT-4.1, GPT-4.1-mini, Gemma-3-27b, Qwen-2.5-vl-72b, Gemini-1.5-Pro, Claude-3-Opus). For polyp detection, the highest-performing VLMs (GPT-4.1 F1: 91.98%, GPT-4.1-mini F1: 91.16%) matched CNN performance (ResNet50 F1: 91.35%), though substantial variability existed across VLMs (F1 range: 19.37% to 91.98%). For classification, CNNs substantially outperformed VLMs: ResNet50 achieved weighted F1 of 74.94% versus 55.07% for GPT-4.1-mini, with performance gaps widening dramatically for rare polyp subtypes where VLMs often achieved 0% F1. External validation on 75 images showed that while ResNet50 performance declined substantially, some VLMs demonstrated more stable cross-institutional performance. These findings establish a task-dependent performance hierarchy where VLMs match CNNs for detection but remain limited for classification, suggesting distinct clinical roles for each approach.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Mohammad Khalafi

Johns Hopkins University

Seyed Amir Ahmad Safavi‐Naini

Icahn School of Medicine at Mount Sinai

Ameneh Salehi

Shahid Beheshti University of Medical Sciences

Journals

SHILAP Revista de lepidopterología

Scientific Reports

Actions

Institutions

Columbia University

Icahn School of Medicine at Mount Sinai

Cedars-Sinai Medical Center

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Vision language models versus machine learning models performance on polyp detection and classification in colonoscopy images

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study