Video anomaly detection is crucial for intelligent surveillance, yet existing vision-language models such as CLIP still struggle with surveillance videos due to fixed viewpoints, illumination variations, weak object semantics, and insufficient temporal reasoning. To address these limitations, we propose ABD-CLIP, a weakly supervised video anomaly detection framework that integrates anomaly-aware visual adaptation, bidirectional prompt learning, and temporal relational modeling in a unified manner. Specifically, a Dynamic Prompt Adapter (DPA) performs lightweight anomaly-aware adaptation of frozen CLIP visual features and further provides visual-context-guided prompt refinement, thereby improving context-sensitive vision–language alignment. In addition, a bidirectional prompt learning mechanism decomposes each prompt into prefix, category, and suffix components, where the prefix captures scene-oriented contextual priors and the suffix refines anomaly-related attributes. Furthermore, a temporal graph-former (TGF) combines block-wise local self-attention with dual-graph temporal reasoning to jointly model short-term motion dynamics, long-range semantic relations, and temporal continuity. Experiments on UCF-Crime and XD-Violence demonstrate that ABD-CLIP achieves 84.41% AUC on UCF-Crime and 78.52% AP on XD-Violence under the coarse-grained weakly supervised setting, while also improving fine-grained anomaly recognition. Additional analyses on stability, cross-dataset transfer, runtime, and qualitative representation structure further verify the effectiveness of the proposed framework.
Kang et al. (Sun,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: