The problem of Human Action Recognition (HAR) continues to be difficult with the intricate temporal interactions, superfluous frames, and minor visual variations that usually define similar actions. A large number of current approaches are based on either transformer or multimodal architectures, which are computationally costly and cannot be used in real-time or resource constrained systems. To overcome these shortcomings, we proposed a novel and lightweight HAR model that integrates spatial, temporal and consistency modeling in an efficient architecture. In our model, we combine EfficientNet-B2, used in efficient spatial feature extraction, with Unitary Temporal Encoder (UTE), to train long-range temporal dependencies, and Adaptive Temporal Consistency Module (ATCM), to improve local temporal consistency. The proposed system is trained and tested on the datasets UCF101 and HMDB51 with a Top-1 accuracy of 97.10% and 87% under RGB input only which is competitive among RGB-based HAR methods while maintaining low computational cost. The 9.3 million parameter model with an inference rate of 9.5 ms per 16-frame video clip is highly accurate and efficient, thus suitable in inference in real-time and at edges. Ablation studies further demonstrate that the proposed components contribute consistent performance improvements across the evaluated benchmarks.
Majid et al. (Tue,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: