In recent years, with the rapid development of artificial intelligence technology, the robotics field is undergoing a new round of transformation. Data-driven methods represented by deep learning, leveraging their powerful feature extraction capabilities, have successfully overcome problems in traditional robotic grasp detection tasks such as reliance on manually designed geometric features and poor generalization in complex environments. Through end-to-end training of multimodal perceptual data using approaches like convolutional neural networks, Transformer architectures, and reinforcement learning, robotic systems can now achieve recognition, localization, and grasp planning for target objects of different shapes. However, existing grasp detection methods adopt complex-structured networks to improve performance, which not only results in large parameter sizes and low operational efficiency for robotic grasp networks but also hinders their transplantation and deployment on other low-power devices. Meanwhile, single-modal-based object detection and robotic grasp tasks expose potential issues as networks further deepen: single-modal data cannot record relatively complete information features, thus limiting grasp performance. To address the problem that data from different modalities cannot be fully aligned in robotic grasp tasks, an intelligent dual-modal alignment Transformer deep network (SATNet) is designed for the dual-modal input data of RGB images and depth images. This network enables high-precision grasping with a lightweight architecture (model size: 0.27M). Experimental validation was conducted on the Cornell dataset, achieving an inference time of 16.3ms and a grasp accuracy of 97.8%-delivering strong performance at an extremely low computational cost.
Jia et al. (Fri,) studied this question.