Abstract In robotic grasp detection, existing learning-based methods often fail to determine grasp poses while accounting for object attributes or semantic functionality. To address this limitation, We introduce an innovative two-tier grasp pose detection architecture that couples a deep learning model with a pre-trained vision-language foundation model. In the first stage, we introduce TransGNet, a dedicated neural network for initial grasp pose estimation. The coarse predictions from TransGNet are then encoded as input prompts for the vision-language model, which refines the grasp configurations by reasoning about object attributes and task-oriented affordances. To improve the vision-language model’s adaptability to grasp-oriented reasoning, we compile and publish a task-specific fine-tuning dataset designed for robotic manipulation scenarios. Comprehensive evaluations on the Cornell benchmark dataset show that TransGNet surpasses previous approaches, attaining a cutting-edge accuracy rate of 99.3%. Furthermore, by integrating the vision-language model’s semantic reasoning, our framework consistently predicts more functionally appropriate grasps. The results validate the robustness and practical applicability of our approach.
Building similarity graph...
Analyzing shared references across papers
Loading...
Wei Jia
Qingni Yuan
Mingshan Xie
Journal of Physics Conference Series
Building similarity graph...
Analyzing shared references across papers
Loading...
Jia et al. (Fri,) studied this question.
www.synapsesocial.com/papers/68af5210ad7bf08b1ead96f9 — DOI: https://doi.org/10.1088/1742-6596/3077/1/012001