The increasing product variability and skilled labor shortages in manufacturing intensify the need for more flexible and adaptive automation solutions, particularly in assembly systems. However, existing robotic automation typically requires expert-level programming, preventing adaptability to new tasks or product variants. This work presents a modular embodied AI agent framework that enables non-programming operators to implement and reconfigure robotic tasks through intuitive natural language (NL) commands in text or voice format. By integrating open-source Large Language Models (LLMs) and Visual Language Models (VLMs) with a Robot Operating System 2 (ROS2)-based stack, the agent translates user instructions into perception, grasp generation, and motion planning actions. The framework is evaluated in real-world experiments across four scenarios: low-level motion control, vision-guided cable grasping, visual scene feedback, and task-cycle recording with replay for scalable execution. Results show 100% command parsing accuracy for basic motions, reliable visual feedback, and 70% success in dual-cable manipulation, with failures mainly from trajectory planning or self-collisions. VLM inference latency proved highly hardware-dependent, with near real-time performance on high-end GPUs but significant slowdowns on consumer devices, highlighting deployment challenges. These findings show the potential of embodied AI agents to bridge NL interaction with robotic execution, lowering barriers to deploying adaptive, operator-friendly robotic systems in dynamic manufacturing.
Souza et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: