Task planning for robots often involves defining a wide range of possible actions, which makes the process both timeconsuming and labor-intensive. This study presents an automated task-planning framework for teleoperated grasping robots, leveraging large language models (LLMs). Task instructions are provided via voice input, while object detection is performed using the No-Label Detection System (NLDS), which integrates YOLOv8 for coordinate detection and GPT-4o for semantic labeling. This configuration allows the system to flexibly recognize previously unseen objects and align visual outputs with natural language commands. The proposed framework comprises three main stages: (1) task planning based on operator instructions, (2) object detection and extraction, and (3) grasp position estimation. Experiments conducted on a physical robotic system demonstrate the framework’s capability to interpret ambiguous commands and manage overlapping objects, achieving robust performance in complex, real-world scenarios.
Imaizumi et al. (Thu,) studied this question.