Recently, the progress of foundation models such as CLIP and SAM has shown the great potential of zero-shot anomaly detection tasks. However, existing methods usually rely on general descriptions such as “abnormal”, and the semantic coverage is insufficient, making it difficult to express fine-grained anomaly semantics. In addition, CLIP primarily performs global-level alignment, and it is difficult to accurately locate minor defects, while the segmentation quality of SAM is highly dependent on prompt constraints. In order to solve these problems, we proposed DCS, a unified framework that integrates Grounding DINO, CLIP and SAM through three key innovations. First of all, we introduced FinePrompt for adaptive learning, which significantly enhanced the modeling ability of exception semantics by building a fine-grained exception description library and adopting learnable text embeddings. Secondly, we have designed an Adaptive Dual-path Cross-modal Interaction (ADCI) module to achieve more effective cross-modal information exchange through dual-path fusion. Finally, we proposed a Box-Point Prompt Combiner (BPPC), which combines box prior information provided by DINO with the point prompt generated by CLIP, so as to guide SAM to generate finer and more complete segmentation results. A large number of experiments have proved the effectiveness of our method. On the MVTec-AD and VisA datasets, DCS has achieved the most state-of-the-art zero-shot anomaly detection results.
Wan et al. (Thu,) studied this question.