Industrial visual inspection demands high-precision anomaly detection amid scarce annotations and unseen defects. This paper introduces a zero-shot framework leveraging multimodal feature fusion and stabilized attention pooling. CLIP’s global semantic embeddings are hierarchically aligned with DINOv2’s multi-scale structural features via a Dual-Modality Attention (DMA) mechanism, enabling effective cross-modal knowledge transfer for capturing macro- and micro-anomalies. A Stabilized Attention-based Pooling (SAP) module adaptively aggregates discriminative representations using self-generated anomaly heatmaps, enhancing localization accuracy and mitigating feature dilution. Trained solely in auxiliary datasets with multi-task segmentation and contrastive losses, the approach requires no target-domain samples. Extensive evaluation across seven benchmarks (MVTec AD, VisA, BTAD, MPDD, KSDD, DAGM, DTD-Synthetic) demonstrates state-of-the-art performance, achieving 93.4% image-level AUROC, 94.3% AP, 96.9% pixel-level AUROC, and 92.4% AUPRO on average. Ablation studies confirm the efficacy of DMA and SAP, while qualitative results highlight superior boundary precision and noise suppression. The framework offers a scalable, annotation-efficient solution for real-world industrial anomaly detection.
Building similarity graph...
Analyzing shared references across papers
Loading...
He et al. (Fri,) studied this question.
www.synapsesocial.com/papers/6940224e2d562116f28fc1e2 — DOI: https://doi.org/10.3390/electronics14244785
Zongxiang He
Khalil AL-Bukhaiti
Wang Kai-yang
Electronics
Zhejiang University
Zhejiang University of Technology
Huzhou University
Building similarity graph...
Analyzing shared references across papers
Loading...