March 3, 2026

Incorporating Temporal Dynamics and Intricate Spatial-Temporal Dependencies for ActionFormer

Key Points

The method achieves a 16.01% improvement in average mean Average Precision (mAP) for inertial data, and a 7.8% improvement for visual data, indicating significant enhancements in action detection accuracy.
Approach utilizes a specialized transformer model designed for temporal sequential data, incorporating unique modules for both inertial and visual data processing.
Performance testing conducted on the WEAR dataset demonstrates the model's robustness and effectiveness in identifying actions within video sequences.
The paper highlights the importance of addressing challenges related to high temporal dynamics and complex spatial-temporal relationships in video data analysis.

Abstract

ActionFormer is a Transformer architecture model to identify actions and locate their start and end times from video data. This paper optimize the ActionFormer for temporal sequential data, which incorporates our dedicated modules into Transformer-based architectures to improve its performance. Two modules, one for inertial data and another one for visual data, are developed to address the challenges of high temporal dynamics and intricate spatial-temporal dependencies in inertial and visual data, respectively. Experiments on the WEAR dataset show that our method achieves substantial improvement, with a 16.01% improvement in average mAP for inertial data and a 7.8% improvement for visual data compared to the baseline ActionFormer. Additional evaluations on benchmark datasets confirm the robustness and effectiveness of the proposed approach.

Bookmark

Incorporating Temporal Dynamics and Intricate Spatial-Temporal Dependencies for ActionFormer

Key Points

Abstract

Cite This Study