What type of study is this?

This is a Quantitative Study study.

September 23, 2025Open Access

JMM-TGT: Self-supervised 3D action recognition through joint motion masking and topology-guided transformer v1

Key Points

The JMM-TGT model improves action recognition accuracy by enhancing subtle joint movement perception.
Experiments show performance gains of 1.5% to 7.9% across different datasets, including NTU RGB+D.
Model adjustments in attention mechanisms allow better capture of complex dynamic patterns in joint motion.
Self-supervised learning techniques are effectively applied to overcome limitations of traditional single motion feature models.

Abstract

In the field of 3D skeleton action recognition, research on self-supervised learning methods has primarily focused on spatio-temporal feature modeling. However, these methods rely heavily on modeling single motion features, which limits their ability to capture subtle motion variations and complex spatio-temporal relationships. This is a direct result of the fact that understanding the model of the action remains incomplete. To address the above-mentioned issue, this paper proposes the Joint Motion Masking with Topology-Guided Transformer model (JMM-TGT) for action recognition. First, the Joint Motion Masking strategy is applied to enhance the ability of the model to perceive subtle joint movements. This method can generate masking probabilities by combining the differences and similarities in joint motion, thereby guiding the selection of joints to be masked at each time step. Meanwhile, in the transformer-based encoder module, the topological relationship between joints is introduced to adjust the attention mechanism, allowing the model to capture spatio-temporal dependencies and better understand the complex dynamic patterns of joint motion. To verify the performance of the JMM-TGT model, we conducted comparison experiments between it and mainstream action recognition models. Experiments demonstrate that the proposed JMM-TGT achieves performance improvements ranging from 1.5% to 7.9% under different evaluation settings on the NTU RGB+D 60, NTU RGB+D 120, and PUK-MMD datasets.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper