The field of pose estimation is a major problem in computer vision, enabling the direct transformation of an input image into a hierarchical representation of the human skeleton for application in the fields of virtual/augmented reality and human–machine interaction tasks. Research in this field has exploded between 2018 and 2025, with traditional taxonomies such as 2D versus 3D or top-down versus bottom-up no longer sufficient to capture the essence of the evolution of ideas. To solve this problem, we propose a conceptual review in the field of pose estimation, focusing on the intellectual evolution of methods and architecture rather than the standard flat classifications of papers. We divide recent advances into five structural pillars: Representation, which traces the evolution from pixel coordinate regression to heatmaps and probabilistic representation; Architecture, which analyzes the transition from multi-stage CNNs to transformers and state space models (SSMs); Ambiguity and Generalization, which analyzes how self-supervised, uncertainty-aware, and diffusion models address 3D depth ambiguity, occlusion, and domain gaps by modeling multiple plausible poses and reducing dependence on fully supervised in-the-wild 3D labels; Context Extension, which covers temporal dynamics, multi-view fusion, and potential sensors; and Applications, which links algorithms to efficiency, privacy, and foundation models. By providing an in-depth detailing of these pillars, we provide a unified view of the evolution of research paradigms that define human pose estimation and enable the identification of future problems and solutions in pose estimation and human-centered tasks.
Diallo et al. (Thu,) studied this question.