Enabling robots to perceive and understand human actions is a cornerstone of human–robot interaction. Skeleton data, providing structured and privacy-preserving representations, is particularly suitable for this domain. The core challenge lies in effectively capturing both the topological relationships of human joints and their detailed spatiotemporal dynamics. While Graph Convolutional Networks (GCNs) excel at modeling topological structure and 3D Convolutional Neural Networks (3DCNNs) at capturing local motion patterns, fusing these heterogeneous features remains inefficient. This inefficiency stems from the inherent distribution discrepancy and dimensional mismatch between GCN-derived topological features (non-Euclidean, structure-focused) and 3DCNN-derived spatiotemporal features (Euclidean, motion-focused), leading to information redundancy or feature conflict when using traditional fusion strategies (e.g., simple concatenation or summation). Inspired by the trend of heterogeneous multimodal fusion for robust perception, this paper proposes a dual-branch architecture for skeleton-based action recognition. Our model processes data in parallel: the PoseC3D branch extracts fine-grained spatiotemporal dynamics from 3D heatmap volumes, while the InfoGCN branch explores dynamic topological correlations between joints. To achieve deep fusion of these complementary modalities, we introduce a Multi-scale Contextual Attention (Feature fusion module of GCN and 3DCNN, hereinafter referred to as GC3D) module that performs feature alignment and enhancement across temporal, channel, and spatial dimensions. Extensive experiments on NTU RGBFormula: see textD 60, NTU RGBFormula: see textD 120, and a dedicated tennis action dataset demonstrate state-of-the-art performance. Specifically, the model achieves 94.3% (X-Sub) and 97.3% (X-View) accuracy on NTU RGBFormula: see textD 60, and 89.3% (X-Sub) and 91.5% (X-Set) on NTU RGBFormula: see textD 120. Ablation studies confirm the contribution of each branch and the fusion module. This work validates that fusing topological and spatiotemporal features provides a robust and accurate solution for human action recognition, facilitating more intelligent and responsive robotic systems.
Wen et al. (Thu,) studied this question.