What question did this study set out to answer?

This research aims to enhance human activity recognition by developing a vision transformer model with optimized keyframe extraction.

February 8, 2026

Attention-driven vision transformer for human activity recognition with chromatic feature-based keyframe extraction

Key Points

This research aims to enhance human activity recognition by developing a vision transformer model with optimized keyframe extraction.
Utilized improved K-means clustering for keyframe extraction from video data.
Proposed a vision transformer-based DeiT-Echo-State-RNN architecture.
Experimented with spatio-temporal attention through Modified CBAM.
Evaluated the model on benchmark dataset UCF-11.
Achieved a state-of-the-art accuracy of 98.69% on UCF-11.
Demonstrated resilience with impressive results on other datasets like BAR, UCF Sports, and KTH.
Realized computational efficiency with 307.68 ms inference time, 12.14 GMACs, and 24.27 GFLOPs.

Abstract

In the modern context of AI domain enhancement, the aim of Human Activity Recognition (HAR) is development of algorithms and systems capable of comprehending and interpreting human behavior in a variety of contexts. By offering insightful information and increasing productivity, the capacity to automatically identify and categorize human movements from video data has the potential to transform a number of sectors, including robotics, fall detection, elderly medical, surveillance footage, and criminal detection. Even yet, prior researches frequently struggle in small action variances, clutter environment, action background biasness. Moreover, the search for a keyframe extraction has not been met in optimizing computational complexity. To address these issues, we have introduced a chromatic feature based keyframe extraction method utilizing improved K-means clustering to determine keyframes from video heuristically prior to passing the frame sequences to our proposed framework. Our study primarily concentrates on proposing a vision transformer based DeiT-Echo-State-RNN architecture leveraging on a spatio-temporal attention mechanism. On exploring spatio-temporal attention, we have proposed Modified CBAM which has improved overall models’ performance as illustrated in ablation study. In evaluation we have utilised the benchmark dataset UCF-11 to assess the model’s performance, and it attained a state-of-the-art accuracy of 98.69%. It demonstrated its resilience and generalisability by achieving impressive accuracy on various benchmark datasets, such as BAR, UCF Sports, and KTH, beating current methods with accuracy of 98.69%, 70.03%, 98.33%, and 99.16%, respectively. The proposed model demonstrates computational efficiency with 307.68 ms inference time, 12.14 GMACs, 24.27 GFLOPs, and 86.88M trainable parameters.

Bookmark

Attention-driven vision transformer for human activity recognition with chromatic feature-based keyframe extraction

Key Points

Abstract

Cite This Study