What question did this study set out to answer?

This study investigates a new model for improving the detection of video anomalies in surveillance systems.

June 9, 2026Open Access

Temporal-Enhanced and Visual-Text Adaptive Fusion for Weakly Supervised Video Anomaly Detection in Public Safety

Key Points

This study investigates a new model for improving the detection of video anomalies in surveillance systems.
Proposed the Temporal-Enhanced and Visual-Text Adaptive Fusion (TE-VTAF) model.
Developed a Dynamic Local–Global Temporal Adaptive Module (DLG-TAM) to capture temporal dependencies.
Implemented a Visual-Text Adaptive Fusion Module (VTAFM) for integrating cross-modal features.
Achieved an AUC of 88.93% on UCF-Crime and an AP of 85.62% on XD-Violence.
Outperformed state-of-the-art methods in video anomaly detection tasks.

Abstract

In the realm of public safety, the automated identification of potential threats from voluminous surveillance streams is pivotal for developing intelligent security systems. Manual monitoring of such massive video feeds is highly inefficient, prone to human fatigue, and often leads to missed detections or false alarms. Leveraging deep learning for automatic anomaly detection is therefore essential to improve response efficiency and mitigate security risks. Weakly supervised video anomaly detection (WS-VAD) has emerged as a critical yet challenging task in this domain. In this study, we propose the Temporal-Enhanced and Visual-Text Adaptive Fusion (TE-VTAF) model for robust WS-VAD. Specifically, a Dynamic Local–Global Temporal Adaptive Module (DLG-TAM) is designed to capture multi-scale temporal dependencies and extract high-level video semantics. Concurrently, a Visual-Text Adaptive Fusion Module (VTAFM) is introduced to aggregate complementary cross-modal features, utilizing a competitive activation mechanism to suppress redundant information and enhance the discriminative power between normal and anomalous events. To further refine the learning process within the Multiple Instance Learning (MIL) framework, we incorporate a Top-K outer bag loss and a K-maxmin inner bag loss. These constraints effectively maximize the inter-class separability while suppressing label noise from normal instances within positive bags, thereby bolstering the detector’s robustness. Extensive experiments demonstrate that the proposed TE-VTAF consistently outperforms state-of-the-art methods on two large-scale benchmarks, achieving an AUC of 88.93% on UCF-Crime and an AP of 85.62% on XD-Violence.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper