What question did this study set out to answer?

This review aims to explore the evolution of pose estimation methods and architectures, highlighting key advancements.

June 6, 2026Open Access

Deep Human Pose Estimation: A Conceptual Review of Paradigms, Progress, and Frontiers

Key Points

This review aims to explore the evolution of pose estimation methods and architectures, highlighting key advancements.
Analyzed recent advances in pose estimation across five structural pillars: Representation, Architecture, Ambiguity and Generalization, Context Extension, and Applications.
Focused on the transition from traditional classifications to a more conceptual framework for understanding progress in the field.
Identified key advancements in representations, such as the shift from pixel regression to heatmaps and probabilistic models.
Discussed the evolution of architectures from multi-stage CNNs to modern transformer models.
Highlighted the importance of tackling uncertainty and occlusion through self-supervised learning and other novel approaches.

Abstract

The field of pose estimation is a major problem in computer vision, enabling the direct transformation of an input image into a hierarchical representation of the human skeleton for application in the fields of virtual/augmented reality and human–machine interaction tasks. Research in this field has exploded between 2018 and 2025, with traditional taxonomies such as 2D versus 3D or top-down versus bottom-up no longer sufficient to capture the essence of the evolution of ideas. To solve this problem, we propose a conceptual review in the field of pose estimation, focusing on the intellectual evolution of methods and architecture rather than the standard flat classifications of papers. We divide recent advances into five structural pillars: Representation, which traces the evolution from pixel coordinate regression to heatmaps and probabilistic representation; Architecture, which analyzes the transition from multi-stage CNNs to transformers and state space models (SSMs); Ambiguity and Generalization, which analyzes how self-supervised, uncertainty-aware, and diffusion models address 3D depth ambiguity, occlusion, and domain gaps by modeling multiple plausible poses and reducing dependence on fully supervised in-the-wild 3D labels; Context Extension, which covers temporal dynamics, multi-view fusion, and potential sensors; and Applications, which links algorithms to efficiency, privacy, and foundation models. By providing an in-depth detailing of these pillars, we provide a unified view of the evolution of research paradigms that define human pose estimation and enable the identification of future problems and solutions in pose estimation and human-centered tasks.

Deep Human Pose Estimation: A Conceptual Review of Paradigms, Progress, and Frontiers

Key Points

Abstract

Cite This Study