December 5, 2025Open Access

Zero-Shot Industrial Anomaly Detection via CLIP-DINOv2 Multimodal Fusion and Stabilized Attention Pooling

Key Points

Achieving 93.4% image-level AUROC and 94.3% AP highlights the framework's strong performance.
Evaluation across seven benchmarks illustrates effectiveness in real-world applications for anomaly detection.
Utilization of a dual-modality attention mechanism supports effective cross-modal knowledge transfer.
Enhanced precision in localization reduces feature dilution, addressing challenges in visual inspection.

Abstract

Industrial visual inspection demands high-precision anomaly detection amid scarce annotations and unseen defects. This paper introduces a zero-shot framework leveraging multimodal feature fusion and stabilized attention pooling. CLIP’s global semantic embeddings are hierarchically aligned with DINOv2’s multi-scale structural features via a Dual-Modality Attention (DMA) mechanism, enabling effective cross-modal knowledge transfer for capturing macro- and micro-anomalies. A Stabilized Attention-based Pooling (SAP) module adaptively aggregates discriminative representations using self-generated anomaly heatmaps, enhancing localization accuracy and mitigating feature dilution. Trained solely in auxiliary datasets with multi-task segmentation and contrastive losses, the approach requires no target-domain samples. Extensive evaluation across seven benchmarks (MVTec AD, VisA, BTAD, MPDD, KSDD, DAGM, DTD-Synthetic) demonstrates state-of-the-art performance, achieving 93.4% image-level AUROC, 94.3% AP, 96.9% pixel-level AUROC, and 92.4% AUPRO on average. Ablation studies confirm the efficacy of DMA and SAP, while qualitative results highlight superior boundary precision and noise suppression. The framework offers a scalable, annotation-efficient solution for real-world industrial anomaly detection.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

He et al. (Fri,) studied this question.

www.synapsesocial.com/papers/6940224e2d562116f28fc1e2 — DOI: https://doi.org/10.3390/electronics14244785

Authors

Zongxiang He

Khalil AL-Bukhaiti

Wang Kai-yang

Journals

Electronics

Actions

Institutions

Zhejiang University

Zhejiang University of Technology

Huzhou University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Zero-Shot Industrial Anomaly Detection via CLIP-DINOv2 Multimodal Fusion and Stabilized Attention Pooling

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion