What does this research mean for the field?

STAG-Net achieves a mean per joint position error (MPJPE) of 41.8 mm for 6D human pose estimation, promoting geometrically consistent skeletal structures while being lightweight and computationally efficient. Novelty: ClaimNovelty.NOVEL_FINDING. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

The study aims to improve 6D human pose estimation by integrating joint positions and bone orientations for anatomically plausible poses.

March 6, 2026Open Access

STAG-Net: A Lightweight Spatial–Temporal Attention GCN for Real-Time 6D Human Pose Estimation in Human–Robot Collaboration Scenarios

Key Points

The study aims to improve 6D human pose estimation by integrating joint positions and bone orientations for anatomically plausible poses.
Developed STAG-Net that combines Graph Convolutional Networks, attention mechanisms, and Temporal Convolutional Networks.
Utilized a dataset (Human3.6M) for validation and performance assessment.
Proposed a lightweight single-frame variant, STG-Net, for real-time applications.
Achieved an MPJPE of 41.8 mm using 243 input frames on the Human3.6M benchmark.
STG-Net operates in real time at 60 FPS with an MPJPE of 50.8 mm using a single RGB camera.
Demonstrated effectiveness and computational efficiency across multiple large-scale datasets.

Abstract

Most existing research in human pose estimation focuses on predicting joint positions, paying limited attention to recovering the full 6D human pose, which comprises both 3D joint positions and bone orientations. Position-only methods treat joints as independent points, often resulting in structurally implausible poses and increased sensitivity to depth ambiguities—cases where poses share nearly identical joint positions but differ significantly in limb orientations. Incorporating bone orientation information helps enforce geometric consistency, yielding more anatomically plausible skeletal structures. Additionally, many state-of-the-art methods rely on large, computationally expensive models, which limit their applicability in real-time scenarios, such as human–robot collaboration. In this work, we propose STAG-Net, a novel 2D-to-6D lifting network that integrates Graph Convolutional Networks (GCNs), attention mechanisms, and Temporal Convolutional Networks (TCNs). By simultaneously learning joint positions and bone orientations, STAG-Net promotes geometrically consistent skeletal structures while remaining lightweight and computationally efficient. On the Human3.6M benchmark, STAG-Net achieves an MPJPE of 41.8 mm using 243 input frames. In addition, we introduce a lightweight single-frame variant, STG-Net, which achieves 50.8 mm MPJPE while operating in real time at 60 FPS using a single RGB camera. Extensive experiments on multiple large-scale datasets demonstrate the effectiveness and efficiency of the proposed approach.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper