What question did this study set out to answer?

The aim is to enhance 3D human pose estimation from 2D images using a novel transformer framework.

May 25, 2026

RUMPL: Ray-Based Transformers for Universal Multi-View 2D to 3D Human Pose Lifting

Key Points

The aim is to enhance 3D human pose estimation from 2D images using a novel transformer framework.
Introduced RUMPL, a transformer-based model using a 3D ray-based representation of 2D keypoints.
Implemented View Fusion Transformer to aggregate information from multiple views effectively.
Evaluated the model on Human3.6M and CMU Panoptic datasets under various multi-view conditions.
Achieved a 56.6% reduction in MPJPE (All KP) on Human3.6M compared to triangulation-based methods.
Improved performance on the CMU Panoptic dataset exceeding 70% compared to other transformer-based approaches.
Proved robustness and scalability in evaluating new multi-view and multi-person datasets.

Abstract

Estimating 3D human poses from 2D images remains challenging due to occlusions and projective ambiguity. Multi-view learning-based approaches mitigate these issues but often fail to generalize to real-world scenarios, as large-scale multi-view datasets with 3D ground truth are scarce and captured under constrained conditions. To overcome this limitation, recent methods rely on 2D pose estimation combined with 2D-to-3D pose lifting trained on synthetic data. Building on our previous MPL framework, we propose RUMPL, a transformer-based 3D pose lifter that introduces a 3D ray-based representation of 2D keypoints. This formulation enables a model agnostic to camera parameters that can be universally deployed across arbitrary camera configurations in a given area without retraining or fine-tuning. A new View Fusion Transformer leverages learned fused-ray tokens to aggregate information along rays, further improving multi-view consistency. Evaluation on standard benchmarks shows that RUMPL significantly outperforms existing methods, yielding a 56.6% MPJPE (All KP) reduction on Human3.6M over triangulation-based methods and exceeding 70% improvement on the CMU Panoptic dataset when compared to transformer-based image-representation approaches. Results on new benchmarks, including in-the-wild multi-view and multi-person datasets, confirm its robustness and scalability.

Ask AI

Mark Helpful

Bookmark

Relay