Most existing research in human pose estimation focuses on predicting joint positions, paying limited attention to recovering the full 6D human pose, which comprises both 3D joint positions and bone orientations. Position-only methods treat joints as independent points, often resulting in structurally implausible poses and increased sensitivity to depth ambiguities—cases where poses share nearly identical joint positions but differ significantly in limb orientations. Incorporating bone orientation information helps enforce geometric consistency, yielding more anatomically plausible skeletal structures. Additionally, many state-of-the-art methods rely on large, computationally expensive models, which limit their applicability in real-time scenarios, such as human–robot collaboration. In this work, we propose STAG-Net, a novel 2D-to-6D lifting network that integrates Graph Convolutional Networks (GCNs), attention mechanisms, and Temporal Convolutional Networks (TCNs). By simultaneously learning joint positions and bone orientations, STAG-Net promotes geometrically consistent skeletal structures while remaining lightweight and computationally efficient. On the Human3.6M benchmark, STAG-Net achieves an MPJPE of 41.8 mm using 243 input frames. In addition, we introduce a lightweight single-frame variant, STG-Net, which achieves 50.8 mm MPJPE while operating in real time at 60 FPS using a single RGB camera. Extensive experiments on multiple large-scale datasets demonstrate the effectiveness and efficiency of the proposed approach.
Yang et al. (Wed,) studied this question.