What question did this study set out to answer?

This research aims to improve zero-shot anomaly detection by refining prompts and enhancing visual representation through innovative integration techniques.

March 13, 2026Open Access

S2SWCLIP: semantic-optimized prompts with spatial-wavelet synergy for zero-shot anomaly detection

Q: What does this research mean for the field?

S2SWCLIP significantly improves zero-shot anomaly detection performance by integrating semantic-optimized prompts with wavelet-spatial synergy. Novelty: ClaimNovelty.NOVEL_FINDING. Consensus alignment: ConsensusAlignment.CHALLENGES_CONSENSUS.

Key Points

This research aims to improve zero-shot anomaly detection by refining prompts and enhancing visual representation through innovative integration techniques.
Development of S2SWCLIP for anomaly detection.
Integration of object-agnostic and contrastive prompts to clarify semantic boundaries.
Application of spatial-to-wavelet transformation for feature representation.
Utilization of entropy-gain similarity for optimizing image-text alignment.
Evaluation on 14 real-world datasets to compare performance.
S2SWCLIP significantly outperforms existing anomaly detection methods.
Improvements in alignment between text and visual embeddings are observed.
The method shows enhanced ability to capture local details and global structures in images.

Abstract

Zero-shot anomaly detection is crucial for privacy-sensitive scenarios with limited target data. However, prominent methods based on visual-language models suffer from semantic overlap due to simplistic generic prompts, while the reductive design of visual representations fails to capture crucial local details and global structures, leading to alignment deviation between text and visual embeddings. In this paper, we propose S2SWCLIP, which integrates semantic-optimized prompts with wavelet-spatial synergy to advance the design principles by refining prompt learning, enriching visual representations, and optimizing cross-modal alignment. Initially, object-agnostic prompts, contrastive normal-anomaly prompts, and anomaly-referenced prompts are combined to delineate sharper semantic boundaries via strongly contrasting vocabulary, while comprehensive semantic information is optimized through embedding integration enabled by a cross-informative adaptive fusion mechanism. Subsequently, the spatial-to-wavelet transformation module facilitates the conversion of spatial features into frequency domain representations, in synergy with hierarchically fused visual features to retain fine-grained and meaningful image details. Furthermore, the entropy-gain similarity adaptively quantifies information richness to emphasize features with low entropy disparity, optimizing image-text alignment. Large-scale experiments on 14 real-world anomaly detection datasets reveal that S2SWCLIP outperforms numerous methods. The code is available at https://github.com/Huanzh111/S2SW.

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper