What question did this study set out to answer?

To improve video action recognition by addressing challenges in accuracy and computational efficiency.

March 25, 2026Open Access

ETMamba: An Effective Temporal Model for Video Action Recognition

Key Points

To improve video action recognition by addressing challenges in accuracy and computational efficiency.
Developed ETMamba architecture based on Mamba
Implemented Spatiotemporal Feature Preservation module to retain correlations
Introduced Efficient Bidirectional Sharing strategy for temporal dependencies
Integrated Spatiotemporal Collaborative Modulation for dynamic information
Achieved recognition accuracies of 88.3% on Kinetics-400
Achieved recognition accuracies of 74.6% on Something-Something V2
Achieved recognition accuracies of 75.7% on HMDB-51
Achieved recognition accuracies of 98.1% on Breakfast datasets
Maintained low to medium computational complexity

Abstract

Video action recognition faces persistent challenges in balancing accuracy with computational efficiency. While state space models, such as Mamba, have emerged with linear complexity advantages, they exhibit inefficiency in capturing critical spatiotemporal dependencies within video data. To address this core limitation, this paper proposes ETMamba, an enhanced architecture built upon the Mamba baseline. The ETMamba achieve performance breakthroughs via three core innovation modules: (1) the Spatiotemporal Feature Preservation module retains complete original spatiotemporal correlations before data flattening, solving the problem of spatiotemporal feature loss; (2) the Efficient Bidirectional Sharing strategy accurately models bidirectional temporal dependencies, enhancing key temporal dynamic information; and (3) the Spatiotemporal Collaborative Modulation mechanism combines global temporal and local spatial information to achieve collaborative capture of long-short term dependencies and fine-grained features. We conduct experiments on multiple benchmark datasets, achieving recognition accuracies of 88.3%, 74.6%, 75.7%, and 98.1% on Kinetics-400, Something-Something V2, HMDB-51, and Breakfast datasets, respectively, while maintaining low to medium computational complexity.

Read Full Paperexternally

Ask AI

Helpful

Bookmark

View Full Paper