What question did this study set out to answer?

This work aims to enhance category-level articulated object pose perception for AI systems by addressing existing limitations.

April 17, 2026

Probing Effective and Efficient Category-Level Articulated Object Pose Perception

Key Points

This work aims to enhance category-level articulated object pose perception for AI systems by addressing existing limitations.
Developed a joint-centric hierarchical model for explicit kinematic constraints.
Utilized SE(3) manifold formulation for stable optimization and rotation representation.
Employed a proxy canonicalization strategy for dynamic pose tracking with a keyframe mechanism.
CAPER++ shows state-of-the-art accuracy on various benchmarks.
Achieved real-time inference at 50 frames per second without post-processing.
Demonstrated improved robustness against self-occlusions and motion complexities.

Abstract

Category-level articulated object pose perception-encompassing both static pose estimation and dynamic pose tracking-is critical for embodied AI systems interacting with complex environments. Due to the inherent complexity and diverse motion structures of articulated objects, existing methods often exhibit limitations in adequately modeling kinematic constraints, handling self-occlusions, and meeting optimization requirements. Building upon EfficientCAPER 1, this work introduces CAPER++, a unified framework addressing these limitations through three key innovations: first, a joint-centric hierarchical model decomposes objects into a root part and constrained parts linked by joints, explicitly embedding kinematic constraints for geometrically consistent pose recovery. Second, an SE(3) manifold formulation leverages Lie algebra in the tangent space for singularity-free rotation representation and stable optimization, replacing error-prone direct regression. Third, for tracking, a proxy canonicalization strategy reformulates pose updates as SE(3) increment predictions relative to keyframes, enhanced by a dynamic keyframe mechanism to suppress drift. Extensive experiments on synthetic (ArtImage, PM-Videos), semi-synthetic (ReArtMix, ReArt-Videos), and real-world (RobotArm, RobotArm-Videos) benchmarks demonstrate state-of-the-art accuracy and robustness. CAPER++ achieves real-time inference (50 FPS) without post-processing, significantly advancing category-level articulated perception for real-world applications. Codes and datasets are available at project website: https://sites.google.com/view/caperplusplus https://sites.google.com/view/caperplusplus.

Bookmark

Probing Effective and Efficient Category-Level Articulated Object Pose Perception

Key Points

Abstract

Cite This Study