What does this research mean for the field?

Combining YOLOv11 with Oriented Bounding Boxes (YOLOv11-OBB) and depth data enables highly accurate, real-time, and identity-aware robot grasping and classification suitable for lightweight embedded systems. Novelty: ClaimNovelty.METHODOLOGICAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

To develop an efficient robot grasping system that integrates RGB image analysis and depth-based localization.

May 28, 2026Open Access

Object Detection Using YOLO Oriented Bounding Box For Robot Grasping Applications

Key Points

To develop an efficient robot grasping system that integrates RGB image analysis and depth-based localization.
Utilized YOLOv11 with Oriented Bounding Boxes (YOLOv11-OBB) for object pose prediction from RGB images.
Combined RGB-based grasp detection with depth data from an Intel RealSense D435 camera for 3D localization.
Evaluated models trained on a combination of datasets for grasping and classification tasks.
The grasp-only model achieved 99.5% mAP@0.5, 94.0% mAP@0.5:0.95, and 99.4% precision at IoU 0.6, with inference time of 29 ms.
The grasp+classification model demonstrated over 97% grasp success with only 619 training images, indicating robustness despite limited data.

Abstract

Industrial grasping requires accurate pose estimation and identity-aware selection, yet most deep-learning grasp detectors are object-agnostic and computationally heavy, so most existing approaches limit goal-directed manipulation and deployment on lightweight embedded systems. This paper presents a robot grasping system that combines RGB-based grasp detection and depth-based 3D localization with low-cost robot control. We use YOLOv11 with Oriented Bounding Boxes (YOLOv11-OBB) to simultaneously predict object pose and classification from RGB images. These detections are combined with depth data from an Intel RealSense D435 RGB-D camera to compute a 3D grasping pose. A 4-DOF robot arm controlled via a PLC performs pick-and-place operations based on the estimated poses. The paper evaluates two scenarios: a grasping-only model trained on a combination of Cornell and custom real-world datasets, and a grasping and classification model that allows for the selective manipulation of multiple object types. Experimental results show that the grasp-only model achieves 99.5% mAP@0.5, 94.0% mAP@0.5:0.95, and 99.4% precision at an IoU threshold of 0.6, while maintaining an inference time of 29 ms under the tested hardware setting. Compared with several representative grasp detection methods, the proposed approach achieves competitive accuracy and real-time performance. The grasp+classification model achieves over 97% grasp success across various object types with only 619 training images, indicating good performance under the tested experimental conditions despite the limited dataset size.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper