Key points are not available for this paper at this time.
The study of reconstruction of hands and objects from color monocular images has garnered considerable attention in recent years. In existing methods, parametric models are constructed at single scale, and the interaction between hands and objects has not fully be explored. As a result, the multiscale information in 2D images cannot be fully exploited. At the same time, the lack of feature fusion and insufficient utilization of labels also have a great impact on the reconstruction accuracy. To address the limitations, a new framework is proposed, which comprises three key modules. Firstly, a multiscale feature extractor, which generates a multiscale representation of feature, is used to capture the interaction between hand and object more effectively. Secondly, a bridge based on attention has been used to establish the connection between hand and object representations, which facilitates the integration of them. Lastly, a module based on token merge is introduced into the framework, which provides the segmentation representation of object. The experimental results on two datasets, named Obman and DexYCB, demonstrated that the proposed method had good performance and achieved a shape error about 0. 121 cm^2 on Obman and 0. 40 cm^2 on DexYCB, outperforming the state-of-the-art methods. This study will probably provide the human-computer interaction methods with broader application prospects.
Zhang et al. (Wed,) studied this question.