October 20, 2025Open Access

From Grounding to Manipulation: Case Studies of Foundation Model Integration in Embodied Robotic Systems

Key Points

Performance limits expose trade-offs in generalization and data efficiency during robotic task execution.
Two focused case studies highlight distinct paradigms: vision-language-action models and modular pipelines.
Fine-grained instruction understanding is assessed through complex instruction grounding tasks.
Robotics integration of foundation models presents emerging challenges and opportunities in real-world applications.

Abstract

Foundation models (FMs) are increasingly used to bridge language and action in embodied agents, yet the operational characteristics of different FM integration strategies remain under-explored -- particularly for complex instruction following and versatile action generation in changing environments. This paper examines three paradigms for building robotic systems: end-to-end vision-language-action (VLA) models that implicitly integrate perception and planning, and modular pipelines incorporating either vision-language models (VLMs) or multimodal large language models (LLMs). We evaluate these paradigms through two focused case studies: a complex instruction grounding task assessing fine-grained instruction understanding and cross-modal disambiguation, and an object manipulation task targeting skill transfer via VLA finetuning. Our experiments in zero-shot and few-shot settings reveal trade-offs in generalization and data efficiency. By exploring performance limits, we distill design implications for developing language-driven physical agents and outline emerging challenges and opportunities for FM-powered robotics in real-world conditions.

Read Full Paperexternally

Demander à l'IA

Bookmark

View Full Paper

Cite This Study

Sui et al. (Wed,) studied this question.

synapsesocial.com/papers/68f5c338e2d8b12842645baa https://doi.org/https://doi.org/10.48550/arxiv.2505.15685

Demander à l'IA

Bookmark

View Full Paper