In the realm of public safety, the automated identification of potential threats from voluminous surveillance streams is pivotal for developing intelligent security systems. Manual monitoring of such massive video feeds is highly inefficient, prone to human fatigue, and often leads to missed detections or false alarms. Leveraging deep learning for automatic anomaly detection is therefore essential to improve response efficiency and mitigate security risks. Weakly supervised video anomaly detection (WS-VAD) has emerged as a critical yet challenging task in this domain. In this study, we propose the Temporal-Enhanced and Visual-Text Adaptive Fusion (TE-VTAF) model for robust WS-VAD. Specifically, a Dynamic Local–Global Temporal Adaptive Module (DLG-TAM) is designed to capture multi-scale temporal dependencies and extract high-level video semantics. Concurrently, a Visual-Text Adaptive Fusion Module (VTAFM) is introduced to aggregate complementary cross-modal features, utilizing a competitive activation mechanism to suppress redundant information and enhance the discriminative power between normal and anomalous events. To further refine the learning process within the Multiple Instance Learning (MIL) framework, we incorporate a Top-K outer bag loss and a K-maxmin inner bag loss. These constraints effectively maximize the inter-class separability while suppressing label noise from normal instances within positive bags, thereby bolstering the detector’s robustness. Extensive experiments demonstrate that the proposed TE-VTAF consistently outperforms state-of-the-art methods on two large-scale benchmarks, achieving an AUC of 88.93% on UCF-Crime and an AP of 85.62% on XD-Violence.
Si et al. (Sat,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: