Abstract In robotic grasp detection, existing learning-based methods often fail to determine grasp poses while accounting for object attributes or semantic functionality. To address this limitation, We introduce an innovative two-tier grasp pose detection architecture that couples a deep learning model with a pre-trained vision-language foundation model. In the first stage, we introduce TransGNet, a dedicated neural network for initial grasp pose estimation. The coarse predictions from TransGNet are then encoded as input prompts for the vision-language model, which refines the grasp configurations by reasoning about object attributes and task-oriented affordances. To improve the vision-language model’s adaptability to grasp-oriented reasoning, we compile and publish a task-specific fine-tuning dataset designed for robotic manipulation scenarios. Comprehensive evaluations on the Cornell benchmark dataset show that TransGNet surpasses previous approaches, attaining a cutting-edge accuracy rate of 99.3%. Furthermore, by integrating the vision-language model’s semantic reasoning, our framework consistently predicts more functionally appropriate grasps. The results validate the robustness and practical applicability of our approach.
Jia et al. (Fri,) studied this question.