What question did this study set out to answer?

The aim is to develop a lightweight grasp detection network that utilizes dual-modal input to enhance robotic performance.

February 16, 2026

Design of a Lightweight Grasp Detection Network for Industrial Humanoid Robots Based on Dual-Modal Fusion Transformer

Key Points

The aim is to develop a lightweight grasp detection network that utilizes dual-modal input to enhance robotic performance.
Designed a dual-modal alignment transformer deep network (SATNet) for RGB and depth images.
Employed end-to-end training with multimodal perceptual data.
Validated performance using the Cornell dataset.
Achieved a grasp accuracy of 97.8%.
Attained an inference time of 16.3 ms.
Model size reduced to 0.27M, enabling efficient deployment.

Abstract

In recent years, with the rapid development of artificial intelligence technology, the robotics field is undergoing a new round of transformation. Data-driven methods represented by deep learning, leveraging their powerful feature extraction capabilities, have successfully overcome problems in traditional robotic grasp detection tasks such as reliance on manually designed geometric features and poor generalization in complex environments. Through end-to-end training of multimodal perceptual data using approaches like convolutional neural networks, Transformer architectures, and reinforcement learning, robotic systems can now achieve recognition, localization, and grasp planning for target objects of different shapes. However, existing grasp detection methods adopt complex-structured networks to improve performance, which not only results in large parameter sizes and low operational efficiency for robotic grasp networks but also hinders their transplantation and deployment on other low-power devices. Meanwhile, single-modal-based object detection and robotic grasp tasks expose potential issues as networks further deepen: single-modal data cannot record relatively complete information features, thus limiting grasp performance. To address the problem that data from different modalities cannot be fully aligned in robotic grasp tasks, an intelligent dual-modal alignment Transformer deep network (SATNet) is designed for the dual-modal input data of RGB images and depth images. This network enables high-precision grasping with a lightweight architecture (model size: 0.27M). Experimental validation was conducted on the Cornell dataset, achieving an inference time of 16.3ms and a grasp accuracy of 97.8%-delivering strong performance at an extremely low computational cost.

Bookmark

Design of a Lightweight Grasp Detection Network for Industrial Humanoid Robots Based on Dual-Modal Fusion Transformer

Key Points

Abstract

Cite This Study