Key points are not available for this paper at this time.
Zero-shot anomaly segmentation (ZSAS) has significantly advanced with the emergence of vision–language models such as CLIP. Among recent approaches for ZSAS, VCP-CLIP introduced visual context prompting (VCP) and demonstrated impressive zero-shot localization capability without class-specific training. However, we revisit VCP-CLIP and find room for supplementation and improvement in the VCP-CLIP framework. In this study, we upgrade VCP-CLIP with simple yet effective modifications designed to enhance pixel-level localization and image-level reliability. Specifically, we propose: (1) a fixed temperature scaling scheme that improves consistency in similarity estimation and stability in training; (2) a learnable anomaly map fusion scheme that adaptively and optimally aggregates anomaly cues from complementary branches; (3) an adaptive loss weighting mechanism that balances segmentation objectives; and (4) an image-conditioned direct prompting module that directly injects visual context information to the text prompts. With minimal architectural changes, our upgraded model, dubbed VCP-CLIP+, achieved high performance improvements over VCP-CLIP on the ZSAS benchmark datasets, outperforming other state-of-the-art CLIP-based ZSAS methods in both pixel-level and image-level anomaly detection.
Building similarity graph...
Analyzing shared references across papers
Loading...
Junhyeok Im
Hanhoon Park
Electronics
Pukyong National University
Building similarity graph...
Analyzing shared references across papers
Loading...
Im et al. (Tue,) studied this question.
www.synapsesocial.com/papers/6a0567e9a550a87e60a20200 — DOI: https://doi.org/10.3390/electronics15102058