• A lightweight temporal violence detection framework was designed by combining YOLOv7-Tiny and Bi-GRU. It got 83.02% accuracy, 83.34% precision, and 82.91% F1-score on the UCF-Crime dataset. • Adaptive Frame Control made it possible to process data in real time in environments with limited resources without adding latency. • The use of mixed-precision training and automatic optimization led to a reduction of 38% in training time as well as memory consumption during model development. • By employing the proposed framework, we were able to achieve a robust trade-off between accurate recognition and efficiency with a low 65.3MB model size, 16.33M parameters, and approximately 404MB peaks GPU Memory. • With an AUC of 80.14% and a video process time of 19.75 ms/sequence (1.23 ms/frame), our model is therefore suitable for real-time surveillance applications even in constrained environments. Surveillance videos play an extremely crucial role in public safety, and therefore, real-time violence detection in surveillance videos has an important role to play. Conversely, existing methods are based on heavy models, which. do not fit in resource-constrained edge environments. To mitigate the trade-off between detection performance and computational efficiency, we therefore present a lightweight temporal violence detection framework, which explicitly addresses the trade-off using high-level features with low computation cost. Novelty of our work is the combination of a hybrid spatial–temporal architecture together with an Adaptive Frame Control (AFC) that dynamically adjusts input frame rates to guarantee stability in frame processing over time without latency accumulation. Our method utilizes a lightweight YOLOv7-Tiny for the spatial extractor and a bidirectional Gated Recurrent Unit (Bi-GRU) for the temporal extraction with frame-level attention mechanism. The model is tested on a publicly available UCF-Crime sub dataset where scenes are classified as violence or non-violence on a balanced sample of 280 videos. Experimental results demonstrate that this approach gives an accuracy rate of 83.02%, a precision of 83.34%, a recall of 83.02%, a F1-score of 82.91%, and an AUC of 80.14%, while only consuming ∼404 MB GPU memory and 65.3MB model size. Lastly mixed-precision training shortens the training time by 38%. These results highlight that it is eligible for real-time surveillance applications at the edge.
Shishir et al. (Fri,) studied this question.