Accurate 3D human pose estimation has important application value in fields such as human–computer interaction, motion analysis, and medical rehabilitation. Traditional single-modal methods have significant limitations in complex environments. This paper proposes a dynamic multi-modal human pose estimation method that fuses visual sensors and millimeter-wave radar. First, we construct a radar point cloud processing framework based on graph neural networks. This framework maintains spatial topological relationships through a k-nearest neighbor graph structure and fuses five-dimensional feature information using a reflection intensity-weighted message passing mechanism. Second, we design a dynamic fusion strategy that combines basic quality assessment, learnable quality assessment, and modal prior weights to achieve quality-aware adaptive fusion. Systematic experiments on two datasets demonstrate the effectiveness of our approach. On the standard environment mRI dataset, our method achieves an MPJPE of 91.82 ± 41.81 mm. On the complex environment mmBody dataset, the average MPJPE is 62.47 ± 22.39 mm. Statistical analysis indicates that all improvements are significant ( p < 0 . 001 ). This method demonstrates excellent robustness in complex environments. • Combines visual sensors and radar to enhance robustness in 3D pose estimation. • Uses graph neural networks for multi-scale spatial and dynamic feature extraction. • Implements dynamic fusion with learnable weights for system stability in challenges. • Overcomes limitations of existing methods with improved radar-visual posture estimation.
Hu et al. (Tue,) studied this question.