What question did this study set out to answer?

The study aims to enhance video anomaly detection under weak supervision to accurately identify abnormal segments in lengthy untrimmed videos.

March 8, 2026Open Access

Spatio-Temporal Capsule Networks for Weakly Supervised Surveillance Video Anomaly Detection

Key Points

The study aims to enhance video anomaly detection under weak supervision to accurately identify abnormal segments in lengthy untrimmed videos.
Introduced ST-CapsNet, a spatio-temporal capsule network.
Videos segmented into 32 parts using 512-dimensional 3D CNN features.
Primary capsules record segment patterns; temporal capsules are created through dynamic routing.
Employed a multiple-instance learning model with bag-level binary cross-entropy loss.
Implemented smoothness and sparsity regularization for temporal consistency.
ST-CapsNet outperformed strong baseline models in weakly supervised FAST split experiments on the UCF-Crime dataset.
Effective capsule routing significantly contributed to temporal reasoning for anomaly detection.
Demonstrated improved localization of anomalies using only video-level supervision.

Abstract

Real surveillance systems require weakly supervised video anomaly detection due to the fact that long untrimmed videos do not always have accurate temporal labels. Models will be required to label a video as normal or abnormal and also to identify sparse anomaly areas with mere video-level supervision. In this paper, we introduce ST-CapsNet, which is a spatio-temporal capsule network that enhances weakly supervised localization of anomalies by using a structured representation and temporal agreement. Every video is broken down into 32 parts and coded with 512-dimensional 3D CNN (Convolutional Neural Network) features. Primary capsules record patterns of segments as vectors, and temporal capsules are created by dynamic routing over time, enabling the related abnormal segments to provide support to a common event representation. Training is based on a multiple-instance learning model that has a bag-level BCE (Binary Cross-Entropy) loss, a ranking loss between abnormal and normal separation, and smoothness and sparsity regularization to impose temporal consistency and sparse event behavior. The weakly supervised FAST (Focused and Accelerated Subset Training) split experiments on the UCF-Crime weakly supervised FAST split demonstrate that ST-CapsNet is better than strong baselines. The findings indicate that capsule routing is an effective part of the whole temporal reasoning of weakly supervised surveillance anomaly detection.

Spatio-Temporal Capsule Networks for Weakly Supervised Surveillance Video Anomaly Detection

Key Points

Abstract

Cite This Study