Abstract The application of RGB-D image data in tasks such as intelligent perception and pose estimation for robots has recently garnered significant attention. A major challenge is effectively utilizing the complementary modalities in RGB-D data sources. A 6D pose estimation method based on the multimodal fusion method, with a hybrid convolutional neural network (CNN) architecture integrating point feature convolution and attention mechanism is proposed in this study. The method integrated a two-stage framework for segmentation and pose regression of RGB-D data, effectively extracting target features and enhancing the model's interpretability and robustness. A convolutional block attention module is introduced to recalibrate the feature maps dynamically, which focuses on information-rich regions and channels, thereby improving the overall performance. Through extensive experiments conducted on benchmark datasets, the proposed method demonstrated remarkable results in target pose estimation: achieving an average accuracy of 96.9% using the ADD metric on the LineMOD dataset, 94.6% using the ADD-S AUC metric on the YCB-Video dataset, and an average ADD-S accuracy of less than 2 cm of 97.7%. These results highlight the superior performance of the proposed approach.
Jin et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: