Key points are not available for this paper at this time.
Human Pose Estimation (HPE) in computer vision (CV) has garnered significant attention due to its diverse applications. Deep convolutional neural networks (CNNs) may be solutions for addressing this challenge, but still face several critical issues. Many existing models employ serial convolution with pooling, leading to low-resolution outputs that are suboptimal for the precise localisation required in HPE. They often prioritise local feature learning, overlooking crucial contextual relationships between key-points. This work addresses these challenges by proposing a novel approach for enhancing HPE. Firstly, the paper evaluates the high-resolution network (HRNet) and its comparative advantages over other CNN architectures. Secondly, it introduces a dual self-attention (DSA) mechanism designed to enhance the model’s global awareness, thereby enriching feature maps with contextual information. Thirdly, it integrates the DSA mechanism into HRNet, crafting DSA-HRNet. The model performance was tested on the COCO Val 2017 validation dataset, showing improvements of 2.3% in mean average precision (mAP), 3% in AP at 50 (AP50), and 2.7% in AP at 75 (AP75). Finally, the work includes an investigation into the effectiveness of the DSA mechanism within the HRNet framework, through a series of experiments, showing this work offers a streamlined and effective solution for improving HPE.
Kumaresan et al. (Mon,) studied this question.