What question did this study set out to answer?

The aim is to enhance human action recognition by addressing challenges in multimodal data fusion and robustness with a novel model.

March 28, 2026Open Access

MAGNet: enhancing action recognition with multimodal fusion and adaptive graph convolution

Key Points

The aim is to enhance human action recognition by addressing challenges in multimodal data fusion and robustness with a novel model.
Developed a Multimodal Adaptive Graph Convolutional Network (MAGNet) for action recognition.
Integrated adaptive graph convolutions with cross-modal self-attention mechanisms.
Utilized the VQ-VAE generative model to manage missing data and produce consistent pose features.
MAGNet achieved 95.2% accuracy on the NTU RGB+D dataset XSub protocol and 98.8% on XView.
The model outperformed existing baselines, showcasing significant improvement in multimodal data fusion and robustness.
Demonstrated effective handling of challenges like occlusion and lighting variations.

Abstract

Human action recognition is a key task in computer vision, with widespread applications in virtual reality, intelligent surveillance, and human-computer interaction. Although deep learning methods have made significant progress in this task, existing methods still face challenges, including difficulties in multimodal data fusion, insufficient robustness in complex environments, and a decrease in accuracy when data is missing or modalities are incomplete. To address these challenges, this paper introduces a novel approach by proposing a human action recognition model based on the Multimodal Adaptive Graph Convolutional Network (MAGNet). The core innovation of this work lies in the integration of adaptive graph convolutions and cross-modal self-attention mechanisms to enhance multimodal data fusion. By dynamically adjusting the contribution of each modality, the proposed method addresses the challenge of incomplete data and improves robustness under real-world conditions, such as missing or noisy modalities. Additionally, the incorporation of the VQ-VAE generative model provides an efficient way to handle missing data and generate anatomically consistent pose features, which sets this approach apart from existing methods. Experimental results show that MAGNet achieves state-of-the-art performance on both the NTU RGB+D and UTD-MHAD datasets. Specifically, on the NTU RGB+D dataset, the model achieves 95.2% and 98.8% accuracy on the XSub and XView protocols, respectively, significantly outperforming existing baseline methods. Furthermore, MAGNet demonstrates strong robustness in multimodal data fusion and complex scene adaptation, effectively handling challenges such as occlusion and lighting variation.

Bookmark

View Full Paper

Bookmark

View Full Paper

MAGNet: enhancing action recognition with multimodal fusion and adaptive graph convolution

Key Points

Abstract

Cite This Study