What question did this study set out to answer?

This research aims to improve the estimation of 9-DoF object poses in challenging conditions without needing specific models.

March 8, 2026Open Access

ProM-Pose: Language-Guided Zero-Shot 9-DoF Object Pose Estimation from RGB-D with Generative 3D Priors

Key Points

This research aims to improve the estimation of 9-DoF object poses in challenging conditions without needing specific models.
Developed a unified cross-modal temporal perception framework.
Integrated language-conditioned generative 3D shape priors.
Utilized asymmetric cross-modal attention for enhanced spatial awareness.
Implemented a decoupled pose decoding strategy with temporal refinement.
Achieved mAP of 75.0% at 5°,2cm and 90.5% at 10°,5cm on CAMERA25 benchmark.
Scored 42.2% at 5°,2cm and 76.0% at 10°,5cm on REAL275 benchmark.
Showed qualitative improvements in stability and robustness under occlusion and varying lighting.

Abstract

Object pose estimation is fundamental for robotic manipulation, autonomous driving, and augmented reality, yet recovering the full 9-DoF state (rotation, translation, and anisotropic 3D scale) from RGB-D observations remains challenging for previously unseen objects. Existing methods either rely on instance-specific CAD models, predefined category boundaries, or suffer from scale ambiguity under sparse observations. We propose ProM-Pose, a unified cross-modal temporal perception framework for zero-shot 9-DoF object pose estimation. By integrating language-conditioned generative 3D shape priors as canonical geometric references, an asymmetric cross-modal attention mechanism for spatially aware fusion, and a decoupled pose decoding strategy with temporal refinement, ProM-Pose constructs metrically consistent and semantically grounded representations without relying on category-specific pose priors or instance-level CAD supervision. Extensive experiments on CAMERA25 and REAL275 benchmarks demonstrate that ProM-Pose achieves competitive or superior performance compared to category-level methods, with mAP of 75.0% at 5°,2cm and 90.5% at 10°,5cm on CAMERA25, and 42.2% at 5°,2cm and 76.0% at 10°,5cm on REAL275 under zero-shot cross-domain evaluation. Qualitative results on real-world logistics scenarios further validate temporal stability and robustness under occlusion and lighting variations. ProM-Pose effectively bridges semantic grounding and metric geometric reasoning within a unified formulation, enabling stable and scale-aware 9-DoF pose estimation for previously unseen objects under open-vocabulary conditions.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Li et al. (Sat,) studied this question.

synapsesocial.com/papers/69ada9bbbc08abd80d5bcb76 https://doi.org/https://doi.org/10.3390/electronics15051111

Bookmark

View Full Paper