Human action recognition is a key research area in computer vision, where accurate recognition relies on effective modeling of both global and local spatiotemporal information. However, existing GCN-based methods often overemphasize the local topological connectivity of human skeletons. Moreover, their temporal modules fail to fully capture the evolution of action sequences, leading to critical instantaneous information being obscured by global representations. To address these problems, we propose an integrated framework termed MADS-GCN. In the spatial modeling stage, we introduce two parallel streams: the Physical Stream uses the adjacency matrix to constrain convolution and capture global structural patterns, while the Topological Stream leverages spatial attention to assign adaptive weights to joints, preserving discriminative local adaptive features. For temporal modeling, a channel-temporal attention mechanism is applied to adaptively refine feature maps, followed by a bidirectional GRU to capture multi-scale temporal patterns. Extensive experiments on NTU RGB+D60, Northwestern-UCLA, and our custom DanceBasic-Set demonstrate the effectiveness of MADS-GCN and indicate its applicability to dance action recognition scenarios.
Wang et al. (Thu,) studied this question.