August 1, 2025Open Access

Detection of Robot Optimal Grasping Pose Based on Vision-Language Models

Key Points

Achieving an accuracy rate of 99.3%, the new approach enhances robotic grasp detection.
TransGNet provides initial grasp pose estimates, then refines them using a vision-language model.
Using a newly compiled dataset, this method improves adaptability to task-specific reasoning.
The framework highlights the importance of semantic understanding in achieving functionally appropriate grasps.

Abstract

Abstract In robotic grasp detection, existing learning-based methods often fail to determine grasp poses while accounting for object attributes or semantic functionality. To address this limitation, We introduce an innovative two-tier grasp pose detection architecture that couples a deep learning model with a pre-trained vision-language foundation model. In the first stage, we introduce TransGNet, a dedicated neural network for initial grasp pose estimation. The coarse predictions from TransGNet are then encoded as input prompts for the vision-language model, which refines the grasp configurations by reasoning about object attributes and task-oriented affordances. To improve the vision-language model’s adaptability to grasp-oriented reasoning, we compile and publish a task-specific fine-tuning dataset designed for robotic manipulation scenarios. Comprehensive evaluations on the Cornell benchmark dataset show that TransGNet surpasses previous approaches, attaining a cutting-edge accuracy rate of 99.3%. Furthermore, by integrating the vision-language model’s semantic reasoning, our framework consistently predicts more functionally appropriate grasps. The results validate the robustness and practical applicability of our approach.

Read Full Paperexternally

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper