What question did this study set out to answer?

The aim is to enhance video anomaly detection in surveillance using a novel framework that integrates visual, temporal, and prompt learning adaptations.

June 17, 2026Open Access

ABD-CLIP: anomaly-aware bidirectional CLIP with temporal graph-former for weakly supervised video anomaly detection

Key Points

The aim is to enhance video anomaly detection in surveillance using a novel framework that integrates visual, temporal, and prompt learning adaptations.
Proposed ABD-CLIP framework utilizing a Dynamic Prompt Adapter for feature adaptation.
Implemented bidirectional prompt learning to improve contextual understanding of anomalies.
Used a temporal graph-former for advanced reasoning in video sequences.
Achieved 84.41% AUC on UCF-Crime and 78.52% AP on XD-Violence under weakly supervised conditions.
Demonstrated improvements in fine-grained anomaly recognition.
Stability, cross-dataset transfer, and runtime analyses support the framework's effectiveness.

Abstract

Video anomaly detection is crucial for intelligent surveillance, yet existing vision-language models such as CLIP still struggle with surveillance videos due to fixed viewpoints, illumination variations, weak object semantics, and insufficient temporal reasoning. To address these limitations, we propose ABD-CLIP, a weakly supervised video anomaly detection framework that integrates anomaly-aware visual adaptation, bidirectional prompt learning, and temporal relational modeling in a unified manner. Specifically, a Dynamic Prompt Adapter (DPA) performs lightweight anomaly-aware adaptation of frozen CLIP visual features and further provides visual-context-guided prompt refinement, thereby improving context-sensitive vision–language alignment. In addition, a bidirectional prompt learning mechanism decomposes each prompt into prefix, category, and suffix components, where the prefix captures scene-oriented contextual priors and the suffix refines anomaly-related attributes. Furthermore, a temporal graph-former (TGF) combines block-wise local self-attention with dual-graph temporal reasoning to jointly model short-term motion dynamics, long-range semantic relations, and temporal continuity. Experiments on UCF-Crime and XD-Violence demonstrate that ABD-CLIP achieves 84.41% AUC on UCF-Crime and 78.52% AP on XD-Violence under the coarse-grained weakly supervised setting, while also improving fine-grained anomaly recognition. Additional analyses on stability, cross-dataset transfer, runtime, and qualitative representation structure further verify the effectiveness of the proposed framework.

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper