Monocular 3D human pose estimation remains a fundamental challenge in computer vision, primarily due to the inherent ill-posed nature of 2D-to-3D lifting and persistent depth ambiguity. These difficulties have been further compounded by the inadequate exploitation of anatomical skeletal topology in human skeletal inputs and the constrained capacity of existing architectures to capture multi-scale motion dynamics. To address these limitations, we propose SCARNet, a unified framework designed to integrate spatial–channel collaborative attention with adaptive multi-scale receptive field enhancement. The architecture establishes a hierarchical, structure-aware representation learning paradigm. Specifically, a Spatial–Channel Attention Fusion mechanism jointly encodes graph-based skeletal topology, non-local inter-joint interactions, and global spatial context. This design enables the characterization of long-range inter-limb dependencies while strictly maintaining anatomical consistency. Furthermore, an Adaptive Multi-scale Receptive-field Enhancement strategy reconciles fine-grained joint articulations with coarse-grained body trajectories through dynamic scale calibration, thereby alleviating feature misalignment induced by motion heterogeneity. Extensive evaluations on the Human3.6M and MPI-INF-3DHP benchmarks demonstrate the robustness of our approach. SCARNet achieves a Mean Per-Joint Position Error (MPJPE) of 47.3 mm on Human3.6M under Protocol #1 and attains a Percentage of Correct Keypoints (PCK) of 86.5% on MPI-INF-3DHP in a zero-shot setting, consistently establishing a new state-of-the-art performance. Code and models are available at https://github.com/korama-ezon/SCARNet.
Yi et al. (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: