What question did this study set out to answer?

The study aims to improve human pose estimation in complex scenes by managing structural information propagation.

June 10, 2026Open Access

CSPA-Net: Controlled Structural Propagation and Cross-Axis Attention Network for Human Pose Estimation

Key Points

The study aims to improve human pose estimation in complex scenes by managing structural information propagation.
Developed a heatmap-based framework called CSPA-Net.
Utilized a Skeleton-consistent Manhattan Constraint for spatial prior.
Implemented a Pose-Structured Cross-Axis Attention module for better contextual information exchange.
Achieved 75.3% AP and 80.9% AR on COCO val2017 and 69.6% AP on CrowdPose test set.
Outperformed HRNet-W32 baseline under identical input conditions.
Demonstrated effectiveness of controlled structural propagation in occluded scenarios.

Abstract

Human pose estimation in crowded images remains difficult because the visual evidence around many joints is incomplete, and responses from nearby persons may be mistakenly incorporated into the target skeleton. To address this issue, this paper presents CSPA-Net, a heatmap-based pose estimation framework that controls the propagation of structural information during occluded-joint recovery. The proposed network first estimates joint reliability from coarse heatmaps by considering both the dispersion and the spatial spread of the response distribution. Based on these soft joint locations and uncertainty cues, a Skeleton-consistent Manhattan Constraint is constructed to define a target-oriented spatial prior. This prior limits structural propagation to regions that are more consistent with the estimated target skeleton, reducing the chance of introducing features from adjacent instances. In addition, a Pose-Structured Cross-Axis Attention module is designed to exchange row-wise and column-wise contextual information so that lateral body symmetry and vertical kinematic dependencies can be modeled in a more directed manner. Finally, multiscale adaptive aggregation combines coarse structural cues with fine local details for heatmap prediction. Experiments on COCO val2017 and CrowdPose show that CSPA-Net achieves 75.3% AP and 80.9% AR on COCO val2017 and 69.6% AP on the CrowdPose test set, outperforming the HRNet-W32 baseline under the same input setting. These results suggest that controlled structural propagation is useful for improving pose estimation in occluded and crowded scenes.

CSPA-Net: Controlled Structural Propagation and Cross-Axis Attention Network for Human Pose Estimation

Key Points

Abstract

Cite This Study

Also Consider

Also Consider