Video action recognition faces persistent challenges in balancing accuracy with computational efficiency. While state space models, such as Mamba, have emerged with linear complexity advantages, they exhibit inefficiency in capturing critical spatiotemporal dependencies within video data. To address this core limitation, this paper proposes ETMamba, an enhanced architecture built upon the Mamba baseline. The ETMamba achieve performance breakthroughs via three core innovation modules: (1) the Spatiotemporal Feature Preservation module retains complete original spatiotemporal correlations before data flattening, solving the problem of spatiotemporal feature loss; (2) the Efficient Bidirectional Sharing strategy accurately models bidirectional temporal dependencies, enhancing key temporal dynamic information; and (3) the Spatiotemporal Collaborative Modulation mechanism combines global temporal and local spatial information to achieve collaborative capture of long-short term dependencies and fine-grained features. We conduct experiments on multiple benchmark datasets, achieving recognition accuracies of 88.3%, 74.6%, 75.7%, and 98.1% on Kinetics-400, Something-Something V2, HMDB-51, and Breakfast datasets, respectively, while maintaining low to medium computational complexity.
Building similarity graph...
Analyzing shared references across papers
Loading...
Rundong Hong
Changji Wen
Patrick Sun
Electronics
University of Wisconsin–Madison
Jilin Agricultural University
Midea Group (China)
Building similarity graph...
Analyzing shared references across papers
Loading...
Hong et al. (Mon,) studied this question.
www.synapsesocial.com/papers/69c37adcb34aaaeb1a67cc20 — DOI: https://doi.org/10.3390/electronics15061338