What question did this study set out to answer?

This work aims to develop an embodied AI framework that allows non-programming users to configure robotic tasks using natural language.

February 21, 2026Open Access

Embodied AI Agent Framework for (Re)Programming Robotic Tasks in Flexible Assembly using LLMs and VLMs

Key Points

This work aims to develop an embodied AI framework that allows non-programming users to configure robotic tasks using natural language.
Developed a modular AI agent framework integrating LLMs and VLMs with ROS2.
Enabled task reconfiguration via natural language inputs (text and voice).
Evaluated in four real-world scenarios: motion control, cable grasping, visual feedback, and task-cycle recording.
Achieved 100% command parsing accuracy for basic motions.
Obtained 70% success rate in dual-cable manipulation tasks.
Visual feedback demonstrated reliability despite latency issues on consumer devices.

Abstract

The increasing product variability and skilled labor shortages in manufacturing intensify the need for more flexible and adaptive automation solutions, particularly in assembly systems. However, existing robotic automation typically requires expert-level programming, preventing adaptability to new tasks or product variants. This work presents a modular embodied AI agent framework that enables non-programming operators to implement and reconfigure robotic tasks through intuitive natural language (NL) commands in text or voice format. By integrating open-source Large Language Models (LLMs) and Visual Language Models (VLMs) with a Robot Operating System 2 (ROS2)-based stack, the agent translates user instructions into perception, grasp generation, and motion planning actions. The framework is evaluated in real-world experiments across four scenarios: low-level motion control, vision-guided cable grasping, visual scene feedback, and task-cycle recording with replay for scalable execution. Results show 100% command parsing accuracy for basic motions, reliable visual feedback, and 70% success in dual-cable manipulation, with failures mainly from trajectory planning or self-collisions. VLM inference latency proved highly hardware-dependent, with near real-time performance on high-end GPUs but significant slowdowns on consumer devices, highlighting deployment challenges. These findings show the potential of embodied AI agents to bridge NL interaction with robotic execution, lowering barriers to deploying adaptive, operator-friendly robotic systems in dynamic manufacturing.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper