March 3, 2026Open Access

Prompt-Based Object Detection and Task Planning for Automated Grasping Using Large Language Models

Key Points

Robust performance in managing overlapping objects is achieved by the framework.
The framework demonstrates effective task planning based on voice instructions and object recognition.
Automated detection and extraction of previously unseen objects is enabled using no-label detection system techniques.
The method combines YOLOv8 for coordinate detection and GPT-4o for enhanced semantic labeling.

Abstract

Task planning for robots often involves defining a wide range of possible actions, which makes the process both timeconsuming and labor-intensive. This study presents an automated task-planning framework for teleoperated grasping robots, leveraging large language models (LLMs). Task instructions are provided via voice input, while object detection is performed using the No-Label Detection System (NLDS), which integrates YOLOv8 for coordinate detection and GPT-4o for semantic labeling. This configuration allows the system to flexibly recognize previously unseen objects and align visual outputs with natural language commands. The proposed framework comprises three main stages: (1) task planning based on operator instructions, (2) object detection and extraction, and (3) grasp position estimation. Experiments conducted on a physical robotic system demonstrate the framework’s capability to interpret ambiguous commands and manage overlapping objects, achieving robust performance in complex, real-world scenarios.

Bookmark

View Full Paper

Cite This Study

Imaizumi et al. (Thu,) studied this question.

synapsesocial.com/papers/69a7677ebadf0bb9e87e11e2 https://doi.org/https://doi.org/10.1541/ieejjia.20250110

Bookmark

View Full Paper