Human action recognition is a key task in computer vision, with widespread applications in virtual reality, intelligent surveillance, and human-computer interaction. Although deep learning methods have made significant progress in this task, existing methods still face challenges, including difficulties in multimodal data fusion, insufficient robustness in complex environments, and a decrease in accuracy when data is missing or modalities are incomplete. To address these challenges, this paper introduces a novel approach by proposing a human action recognition model based on the Multimodal Adaptive Graph Convolutional Network (MAGNet). The core innovation of this work lies in the integration of adaptive graph convolutions and cross-modal self-attention mechanisms to enhance multimodal data fusion. By dynamically adjusting the contribution of each modality, the proposed method addresses the challenge of incomplete data and improves robustness under real-world conditions, such as missing or noisy modalities. Additionally, the incorporation of the VQ-VAE generative model provides an efficient way to handle missing data and generate anatomically consistent pose features, which sets this approach apart from existing methods. Experimental results show that MAGNet achieves state-of-the-art performance on both the NTU RGB+D and UTD-MHAD datasets. Specifically, on the NTU RGB+D dataset, the model achieves 95.2% and 98.8% accuracy on the XSub and XView protocols, respectively, significantly outperforming existing baseline methods. Furthermore, MAGNet demonstrates strong robustness in multimodal data fusion and complex scene adaptation, effectively handling challenges such as occlusion and lighting variation.
Dong et al. (Thu,) studied this question.