What question did this study set out to answer?

The research aims to develop a lightweight framework for detecting violence in surveillance videos within resource-constrained environments.

May 9, 2026Open Access

Lightweight Temporal Violence Detection Using YOLOv7-Tiny and Bi-GRU for Resource-Constrained Environments

Key Points

The research aims to develop a lightweight framework for detecting violence in surveillance videos within resource-constrained environments.
Combined YOLOv7-Tiny and Bi-GRU for violence detection.
Utilized Adaptive Frame Control for real-time data processing with low resource usage.
Employed mixed-precision training for efficiency in training time and memory consumption.
Achieved 83.02% accuracy and 83.34% precision on the UCF-Crime dataset.
Model size is 65.3MB with 16.33M parameters, consuming approximately 404MB of GPU memory.
Real-time processing of video at 19.75 ms/sequence and an AUC of 80.14%.

Abstract

• A lightweight temporal violence detection framework was designed by combining YOLOv7-Tiny and Bi-GRU. It got 83.02% accuracy, 83.34% precision, and 82.91% F1-score on the UCF-Crime dataset. • Adaptive Frame Control made it possible to process data in real time in environments with limited resources without adding latency. • The use of mixed-precision training and automatic optimization led to a reduction of 38% in training time as well as memory consumption during model development. • By employing the proposed framework, we were able to achieve a robust trade-off between accurate recognition and efficiency with a low 65.3MB model size, 16.33M parameters, and approximately 404MB peaks GPU Memory. • With an AUC of 80.14% and a video process time of 19.75 ms/sequence (1.23 ms/frame), our model is therefore suitable for real-time surveillance applications even in constrained environments. Surveillance videos play an extremely crucial role in public safety, and therefore, real-time violence detection in surveillance videos has an important role to play. Conversely, existing methods are based on heavy models, which. do not fit in resource-constrained edge environments. To mitigate the trade-off between detection performance and computational efficiency, we therefore present a lightweight temporal violence detection framework, which explicitly addresses the trade-off using high-level features with low computation cost. Novelty of our work is the combination of a hybrid spatial–temporal architecture together with an Adaptive Frame Control (AFC) that dynamically adjusts input frame rates to guarantee stability in frame processing over time without latency accumulation. Our method utilizes a lightweight YOLOv7-Tiny for the spatial extractor and a bidirectional Gated Recurrent Unit (Bi-GRU) for the temporal extraction with frame-level attention mechanism. The model is tested on a publicly available UCF-Crime sub dataset where scenes are classified as violence or non-violence on a balanced sample of 280 videos. Experimental results demonstrate that this approach gives an accuracy rate of 83.02%, a precision of 83.34%, a recall of 83.02%, a F1-score of 82.91%, and an AUC of 80.14%, while only consuming ∼404 MB GPU memory and 65.3MB model size. Lastly mixed-precision training shortens the training time by 38%. These results highlight that it is eligible for real-time surveillance applications at the edge.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Shishir et al. (Fri,) studied this question.

synapsesocial.com/papers/69fececcb9154b0b828760ca https://doi.org/https://doi.org/10.1016/j.sasc.2026.200494

Bookmark

View Full Paper