Weakly supervised video anomaly detection (WVAD) aims to locate events or behaviors that deviate from normal patterns in untrimmed videos using video-level labels. Recent studies typically utilize supplementary modalities to assist anomaly detection. However, these methods suffer from two main issues: (1) The limitations of long-duration anomaly event temporal modeling. The model struggles to consistently maintain key information, resulting in the forgetting phenomenon, which affects the tracking of the event's overall dynamic evolution and complicates anomaly event analysis and understanding. (2) The multi-modal fusion strategy is insufficient, particularly when there is temporal inconsistency between visual and audio information, causing the model to overlook key information, directly affecting the accurate detection and recognition of anomalous events. To address these issues, we propose a visual-guided long-term temporal context learning network (LTCLNet). The network consists of three key components: a cross-modal interaction module, a multi-modal fusion module, and a visual-guided parameter optimization strategy. First, to address the forgetting issue in long-duration anomaly detection, we designed a cross-modal interaction module. The key part of this module is the establishment of a cross-matrix mechanism. This mechanism achieves bidirectional temporal guidance across modalities. It allows the temporal modeling of each modality to dynamically integrate information from the other modality. This enables the model to continuously track the dynamic evolution of the event. The tracking is facilitated through shared temporal information between the visual and audio modalities. Secondly, to fully exploit the complementary characteristics between different modalities, we introduced a novel temporal reversal integration method in the multi-modal fusion module. This method reverses the feature sequences of each modality to enhance the model's perception of temporal dynamic changes. By fusing the modality features before and after reversal, the shared temporal structure between modalities is strengthened, improving the model's ability to capture anomalous information. Additionally, our proposed visual-guided parameter optimization strategy trains a parallel visual modality network as a semantic anchor, ensuring that the model stays aligned with a semantically stable and structurally clear visual flow during the learning process, thus ensuring stability and semantic coherence in the training. Extensive experiments on datasets such as XD-Violence demonstrate that our method significantly outperforms existing approaches, particularly achieving notable improvements in the accuracy and stability of long-term anomaly detection. Our code is publicly available at https://github.com/ibliever/LTCLNet .
Zhou et al. (Sat,) studied this question.