To enhance public safety and safeguard lives and property, the automatic detection of anomalous and violent behaviors in video has become a key task in intelligent surveillance systems. Violent actions are often abrupt, rapid, and irregular, posing considerable challenges to conventional approaches. Existing methods based on hand-crafted features and convolutional neural networks still exhibit limitations in spatiotemporal feature extraction, recognition accuracy, and model robustness. To address these issues, this paper proposes HSTNet, a hybrid neural architecture that integrates Spiking Neural Networks (SNNs) with Transformers. The framework adopts a dual-branch design: the SNN branch models temporal dynamics in video, while the Transformer branch extracts spatial structural information. A feature interaction module is further introduced to enable deep cross-modal fusion. Experiments on multiple datasets including UCF101, HMDB51, Hockey Fight, and Movies Fight demonstrate that HSTNet achieves significantly higher accuracy than state-of-the-art baselines, indicating strong performance and promising application potential.
Meng et al. (Thu,) studied this question.