Key points are not available for this paper at this time.
Abstract With an enormous number of hand images generated over time, unleashing pose knowledge from unlabeled images for supervised hand mesh estimation is an emerging yet challenging topic. Semi-supervised and self-supervised approaches have been proposed to alleviate this issue, but the reliance on high-quality fine-grained keypoint detection models or conventional ResNet backbones limits them. In this paper, inspired by the rapid progress of Masked Image Modeling (MIM) and Vision Transformer (ViT) in visual classification tasks, we propose a novel self-supervised pre-training strategy for regressing 3D hand mesh parameters. Our approach involves a unified and multi-granularity strategy with a pseudo keypoint alignment module in the teacher-student framework for learning pose-aware semantic class tokens. We adopt a self-distillation manner between teacher and student network based on MIM pre-training for patch tokens with detailed locality. To better fit low-level regression tasks, we also incorporate masked pixel reconstruction tasks for multi-level representation learning. Additionally, we designed a strong pose estimation baseline using a simple vanilla Vision Transformer (ViT) as the backbone and attached a Pyramidal Mesh Alignment Feedback (PyMAF) head for mesh regression. Extensive experiments demonstrate that our proposed approach, named HandMIM, achieves state-of-the-art (SOTA) performance on various datasets. Notably, HandMIM outperforms specially optimized architectures, achieving an 8.00mm PAVPE (Procrustes Alignment Vertex-Point-Error) on the challenging HO3Dv2 test set, thereby establishing new state-of-the-art records in 3D hand mesh estimation.
Building similarity graph...
Analyzing shared references across papers
Loading...
Nanchang University
Add This Paper to Your Research Feed
Any time a new paper drops it will be there.
Li et al. (Mon,) studied this question.
Synapse has enriched 4 closely related papers on similar clinical questions. Consider them for comparative context: