In the modern context of AI domain enhancement, the aim of Human Activity Recognition (HAR) is development of algorithms and systems capable of comprehending and interpreting human behavior in a variety of contexts. By offering insightful information and increasing productivity, the capacity to automatically identify and categorize human movements from video data has the potential to transform a number of sectors, including robotics, fall detection, elderly medical, surveillance footage, and criminal detection. Even yet, prior researches frequently struggle in small action variances, clutter environment, action background biasness. Moreover, the search for a keyframe extraction has not been met in optimizing computational complexity. To address these issues, we have introduced a chromatic feature based keyframe extraction method utilizing improved K-means clustering to determine keyframes from video heuristically prior to passing the frame sequences to our proposed framework. Our study primarily concentrates on proposing a vision transformer based DeiT-Echo-State-RNN architecture leveraging on a spatio-temporal attention mechanism. On exploring spatio-temporal attention, we have proposed Modified CBAM which has improved overall models’ performance as illustrated in ablation study. In evaluation we have utilised the benchmark dataset UCF-11 to assess the model’s performance, and it attained a state-of-the-art accuracy of 98.69%. It demonstrated its resilience and generalisability by achieving impressive accuracy on various benchmark datasets, such as BAR, UCF Sports, and KTH, beating current methods with accuracy of 98.69%, 70.03%, 98.33%, and 99.16%, respectively. The proposed model demonstrates computational efficiency with 307.68 ms inference time, 12.14 GMACs, 24.27 GFLOPs, and 86.88M trainable parameters.
Hassan et al. (Fri,) studied this question.