Zero-shot vision–language models such as CLIP have demonstrated strong generalization without task-specific training, yet their robustness under semantic ambiguity and distributional perturbations remains insufficiently understood.In this work, we systematically study the behavior of CLIP in a controlled zero-shot image classification and retrieval setting. Using embedding similarity between images and text prompts, we evaluated CLIP ViT-B/32 on a five-class dataset through a series of targeted experiments, including prompt template ablations, hard semantic distinctions (e.g., cat vs. dog vs. wolf), open set classification with semantically related distractors, and robustness tests under image degradation.Although CLIP achieved almost perfect accuracy under standard prompts and clean input, we observed a significant decline in performance under opening scenes and visual damage, revealing extraordinary failure patterns. In contrast, text-to-image retrieval is still relatively more stable in the disturbance. These findings highlight the sensitivity of zero shooting performance to semantic context and input quality. Our study is limited to small-scale data and a single backbone; future work will extend this analysis to larger datasets and alternative vision–language models.
Zheng Xiangquan (Wed,) studied this question.