Real surveillance systems require weakly supervised video anomaly detection due to the fact that long untrimmed videos do not always have accurate temporal labels. Models will be required to label a video as normal or abnormal and also to identify sparse anomaly areas with mere video-level supervision. In this paper, we introduce ST-CapsNet, which is a spatio-temporal capsule network that enhances weakly supervised localization of anomalies by using a structured representation and temporal agreement. Every video is broken down into 32 parts and coded with 512-dimensional 3D CNN (Convolutional Neural Network) features. Primary capsules record patterns of segments as vectors, and temporal capsules are created by dynamic routing over time, enabling the related abnormal segments to provide support to a common event representation. Training is based on a multiple-instance learning model that has a bag-level BCE (Binary Cross-Entropy) loss, a ranking loss between abnormal and normal separation, and smoothness and sparsity regularization to impose temporal consistency and sparse event behavior. The weakly supervised FAST (Focused and Accelerated Subset Training) split experiments on the UCF-Crime weakly supervised FAST split demonstrate that ST-CapsNet is better than strong baselines. The findings indicate that capsule routing is an effective part of the whole temporal reasoning of weakly supervised surveillance anomaly detection.
Almurumudhe et al. (Sat,) studied this question.