In human pose estimation, formulating keypoint localization as a classification task over discretized coordinate grids has proven effective. Essentially, the 2D features of the keypoints are reduced to 1D coordinate representations. This process leads to the loss of spatial constraints among keypoints and increases the difficulty for the model to capture their structural relationships. To address this issue, we propose an enhanced query attention mechanism constrained by bidirectional graphs. The core idea is to establish the topological constraints on the 1D coordinate representations. First, two fundamental connection directions of the skeleton are defined and encoded as a pair of adjacency matrices to enhance the feature interaction capability of the graph convolutional network (GCN). Second, a GCN-guided multi-scale feature fusion framework is designed to effectively combine multi-scale visual features with structural priors, thereby enhancing the representation of keypoint spatial distributions. Finally, a dual-gate module is incorporated into a GCN-guided attention unit to construct a structured query matrix constrained by the bidirectional skeleton graphs, which helps filter out spurious joint interactions and emphasize plausible ones. Extensive experiments on Tai Chi Chuan-Pose, Animal-Pose, AP-10K, MPII, COCO, and COCO-WholeBody datasets demonstrate that the proposed method outperforms existing methods in terms of both accuracy and robustness, particularly in balancing precise local keypoint localization with global pose consistency.
(116183) et al. (Thu,) studied this question.