What question did this study set out to answer?

This research aims to examine the robustness of zero-shot vision-language model CLIP under semantic ambiguity and distributional perturbations.

June 17, 2026

A Systematic Study of CLIP Zero-Shot Robustness Under Semantic Ambiguity and Distributional Perturbations

Key Points

This research aims to examine the robustness of zero-shot vision-language model CLIP under semantic ambiguity and distributional perturbations.
Conducted controlled experiments on a five-class dataset using CLIP ViT-B/32 for image classification and retrieval.
Evaluated performance through prompt template ablations and assessed responses to hard semantic distinctions.
Tested robustness under image degradation and open set classification with semantically related distractors.
CLIP exhibited high accuracy with standard prompts and clear images, but performance dropped significantly in ambiguous scenarios and with degraded inputs.
Text-to-image retrieval remained more stable despite disturbances, indicating varied robustness between tasks.
Uncovered significant failure patterns under specific conditions, suggesting strong sensitivity to semantic context.

Abstract

Zero-shot vision–language models such as CLIP have demonstrated strong generalization without task-specific training, yet their robustness under semantic ambiguity and distributional perturbations remains insufficiently understood.In this work, we systematically study the behavior of CLIP in a controlled zero-shot image classification and retrieval setting. Using embedding similarity between images and text prompts, we evaluated CLIP ViT-B/32 on a five-class dataset through a series of targeted experiments, including prompt template ablations, hard semantic distinctions (e.g., cat vs. dog vs. wolf), open set classification with semantically related distractors, and robustness tests under image degradation.Although CLIP achieved almost perfect accuracy under standard prompts and clean input, we observed a significant decline in performance under opening scenes and visual damage, revealing extraordinary failure patterns. In contrast, text-to-image retrieval is still relatively more stable in the disturbance. These findings highlight the sensitivity of zero shooting performance to semantic context and input quality. Our study is limited to small-scale data and a single backbone; future work will extend this analysis to larger datasets and alternative vision–language models.

Ask AI

Helpful

Bookmark