As the foundation of many visual intelligence systems, human pose estimation has always been a complex and challenging task. Due to issues such as occlusion, viewpoint changes, and motion blur, the visibility of human keypoints will inevitably be affected. Traditional methods often struggle to handle these interfering factors. In fact, effectively utilizing spatiotemporal backgrounds and supervising keypoint prediction tasks in video data remains a key challenge. Our research aims to address the detection challenge of keypoints with different visibility in complex scenes. Firstly, we adopt a solid backbone network, which is effective for highly visible joints. Subsequently, a keypoint-aware spatiotemporal encoder and a dynamic region-sensitive encoder were designed to collect feature-level dynamic variations from temporal contexts, compensating for the feature information of low visibility joints in the target frame. Finally, for completely invisible joints, we innovatively introduced them during the training phase and proposed a loss function based on Pearson correlation coefficient, which achieved keypoint training of positive and negative samples through global constraints. With the help of these innovative components, our method has achieved accurate detection results in various challenging scenarios. We conducted multiple experiments, and the results showed that our proposed framework demonstrates excellent human pose estimation capability.
Xu et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: