What question did this study set out to answer?

The research aims to address the challenges in human action recognition by developing an efficient model that balances accuracy and computational cost.

July 2, 2026Open Access

Efficient spatio-temporal modeling for human action recognition from RGB streams using unitary temporal encoding and adaptive consistency refinement

Key Points

The research aims to address the challenges in human action recognition by developing an efficient model that balances accuracy and computational cost.
Integrated EfficientNet-B2 for spatial feature extraction.
Utilized Unitary Temporal Encoder (UTE) for long-range temporal dependencies.
Employed Adaptive Temporal Consistency Module (ATCM) for improving local temporal consistency.
Achieved a Top-1 accuracy of 97.10% on UCF101 and 87% on HMDB51.
Maintained an inference rate of 9.5 ms per 16-frame video clip.
Model demonstrated consistent performance improvements in ablation studies.

Abstract

The problem of Human Action Recognition (HAR) continues to be difficult with the intricate temporal interactions, superfluous frames, and minor visual variations that usually define similar actions. A large number of current approaches are based on either transformer or multimodal architectures, which are computationally costly and cannot be used in real-time or resource constrained systems. To overcome these shortcomings, we proposed a novel and lightweight HAR model that integrates spatial, temporal and consistency modeling in an efficient architecture. In our model, we combine EfficientNet-B2, used in efficient spatial feature extraction, with Unitary Temporal Encoder (UTE), to train long-range temporal dependencies, and Adaptive Temporal Consistency Module (ATCM), to improve local temporal consistency. The proposed system is trained and tested on the datasets UCF101 and HMDB51 with a Top-1 accuracy of 97.10% and 87% under RGB input only which is competitive among RGB-based HAR methods while maintaining low computational cost. The 9.3 million parameter model with an inference rate of 9.5 ms per 16-frame video clip is highly accurate and efficient, thus suitable in inference in real-time and at edges. Ablation studies further demonstrate that the proposed components contribute consistent performance improvements across the evaluated benchmarks.

AI से पूछें

Bookmark

View Full Paper