ActionFormer is a Transformer architecture model to identify actions and locate their start and end times from video data. This paper optimize the ActionFormer for temporal sequential data, which incorporates our dedicated modules into Transformer-based architectures to improve its performance. Two modules, one for inertial data and another one for visual data, are developed to address the challenges of high temporal dynamics and intricate spatial-temporal dependencies in inertial and visual data, respectively. Experiments on the WEAR dataset show that our method achieves substantial improvement, with a 16.01% improvement in average mAP for inertial data and a 7.8% improvement for visual data compared to the baseline ActionFormer. Additional evaluations on benchmark datasets confirm the robustness and effectiveness of the proposed approach.
Zhao et al. (Wed,) studied this question.