Category-level articulated object pose perception-encompassing both static pose estimation and dynamic pose tracking-is critical for embodied AI systems interacting with complex environments. Due to the inherent complexity and diverse motion structures of articulated objects, existing methods often exhibit limitations in adequately modeling kinematic constraints, handling self-occlusions, and meeting optimization requirements. Building upon EfficientCAPER 1, this work introduces CAPER++, a unified framework addressing these limitations through three key innovations: first, a joint-centric hierarchical model decomposes objects into a root part and constrained parts linked by joints, explicitly embedding kinematic constraints for geometrically consistent pose recovery. Second, an SE(3) manifold formulation leverages Lie algebra in the tangent space for singularity-free rotation representation and stable optimization, replacing error-prone direct regression. Third, for tracking, a proxy canonicalization strategy reformulates pose updates as SE(3) increment predictions relative to keyframes, enhanced by a dynamic keyframe mechanism to suppress drift. Extensive experiments on synthetic (ArtImage, PM-Videos), semi-synthetic (ReArtMix, ReArt-Videos), and real-world (RobotArm, RobotArm-Videos) benchmarks demonstrate state-of-the-art accuracy and robustness. CAPER++ achieves real-time inference (50 FPS) without post-processing, significantly advancing category-level articulated perception for real-world applications. Codes and datasets are available at project website: https://sites.google.com/view/caperplusplus https://sites.google.com/view/caperplusplus.
Zhang et al. (Thu,) studied this question.