What question did this study set out to answer?

The aim is to enhance the VCP-CLIP framework for better zero-shot anomaly segmentation performance.

May 14, 2026Open Access

VCP-CLIP+: Stabilizing and Optimizing VCP-CLIP with Minimal Architectural Changes

Key Points

The aim is to enhance the VCP-CLIP framework for better zero-shot anomaly segmentation performance.
Upgraded VCP-CLIP with fixed temperature scaling for improved similarity estimation.
Introduced a learnable anomaly map fusion scheme for optimal aggregation of anomaly cues.
Implemented an adaptive loss weighting mechanism and an image-conditioned direct prompting module.
VCP-CLIP+ outperformed VCP-CLIP in pixel-level anomaly detection and image-level reliability.
Showed significant performance improvements over state-of-the-art CLIP-based methods on benchmark datasets.

Abstract

Zero-shot anomaly segmentation (ZSAS) has significantly advanced with the emergence of vision–language models such as CLIP. Among recent approaches for ZSAS, VCP-CLIP introduced visual context prompting (VCP) and demonstrated impressive zero-shot localization capability without class-specific training. However, we revisit VCP-CLIP and find room for supplementation and improvement in the VCP-CLIP framework. In this study, we upgrade VCP-CLIP with simple yet effective modifications designed to enhance pixel-level localization and image-level reliability. Specifically, we propose: (1) a fixed temperature scaling scheme that improves consistency in similarity estimation and stability in training; (2) a learnable anomaly map fusion scheme that adaptively and optimally aggregates anomaly cues from complementary branches; (3) an adaptive loss weighting mechanism that balances segmentation objectives; and (4) an image-conditioned direct prompting module that directly injects visual context information to the text prompts. With minimal architectural changes, our upgraded model, dubbed VCP-CLIP+, achieved high performance improvements over VCP-CLIP on the ZSAS benchmark datasets, outperforming other state-of-the-art CLIP-based ZSAS methods in both pixel-level and image-level anomaly detection.

Read Full Paperexternally

Demander à l'IA

Bookmark

View Full Paper