What type of study is this?

This is a Experimental Study study.

September 28, 2025

KeyMamba: keyframe-enhanced state space model for efficient temporal action detection

Key Points

KeyMamba improves action boundary detection in temporal action detection tasks, providing higher accuracy.
The method achieves significant results, with average mAP scores of 70.4 on THUMOS14 and 38.44 on ActivityNet-1.3.
A novel state-space model utilizes key frame features and a bidirectional Mamba block to enhance global feature capture.
Utilization of a temporal deformable attention module optimizes key frame information extraction from video sequences.

Abstract

Temporal action detection (TAD) is a challenging task in the field of video understanding. We determine the semantic labels and precise boundaries of each action instance in an untrimmed video. Over the years, a variety of networks have been proposed, including convolution, graph, and transformer, which have been effectively applied in TAD tasks. Most of the methods have been able to identify the action category well; however, the accuracy of determining the action boundary is still insufficient. Because an action contains several consecutive frames of similar images, we recommend picking out the key frames in the video sequence and enhancing the TAD representation by extracting additional features of the key frames. We propose KeyMamba, a state-space model-based learnable network for TAD tasks. The proposed model applies a bidirectional Mamba block to capture global features efficiently. We also added a temporal deformable attention module to extract key frame features from video clips. These features contain the information of motion changes, and the key frame features complement the global features, which can identify the video action boundaries more accurately. In addition, to get a higher quality Token in the spatial dimension, we added an attention mask before the bidirectional Mamba block encoder. Finally, we also apply masking operations during the forward and backward scanning processes within the bidirectional Mamba block to mitigate the impact of duplicate tokens. Our experiments have achieved outstanding performance on the THUMOS14 and ActivityNet-1.3 datasets, reaching an average mAP of 70.4 on THUMOS14 and an average mAP of 38.44 on ActivityNet-1.3.

Ask AI

Helpful

Bookmark