Zero-shot anomaly detection is crucial for privacy-sensitive scenarios with limited target data. However, prominent methods based on visual-language models suffer from semantic overlap due to simplistic generic prompts, while the reductive design of visual representations fails to capture crucial local details and global structures, leading to alignment deviation between text and visual embeddings. In this paper, we propose S2SWCLIP, which integrates semantic-optimized prompts with wavelet-spatial synergy to advance the design principles by refining prompt learning, enriching visual representations, and optimizing cross-modal alignment. Initially, object-agnostic prompts, contrastive normal-anomaly prompts, and anomaly-referenced prompts are combined to delineate sharper semantic boundaries via strongly contrasting vocabulary, while comprehensive semantic information is optimized through embedding integration enabled by a cross-informative adaptive fusion mechanism. Subsequently, the spatial-to-wavelet transformation module facilitates the conversion of spatial features into frequency domain representations, in synergy with hierarchically fused visual features to retain fine-grained and meaningful image details. Furthermore, the entropy-gain similarity adaptively quantifies information richness to emphasize features with low entropy disparity, optimizing image-text alignment. Large-scale experiments on 14 real-world anomaly detection datasets reveal that S2SWCLIP outperforms numerous methods. The code is available at https://github.com/Huanzh111/S2SW.
Zhang et al. (Wed,) studied this question.