June 19, 2024Open Access

PlanAgent: Embodied Visual-Language Model for Grounded Task planning with Environment Map

Puntos clave

Los puntos clave no están disponibles para este artículo en este momento.

Resumen

Abstract Embodied Intelligence refers to the agent interacting with the environment, perceiving, planning, decision-making, and executing like humans, which is applicable in smart homes, drone inspections, and other domains. Embodied task planning is one of the main tasks of embodied intelligence, which generates detailed step-by-step plans while perceiving the surrounding environment and understanding language instruction. Visual-language models, with powerful multimodal representation capabilities, have been generalized to various tasks. When applied to embodied task planning, it still faces the following two challenges. Firstly, the intricate complexity of the environment leads to difficulties in global environment information modeling. Secondly, frequent turns in task paths result in the dependence on strong spatial reasoning ability. To overcome these challenges, we propose PlanAgent, the first embodied visual-language model for embodied task planning. Specifically, the environment map is employed to model the global environment information. Then we present the environment map encoder to extract task-related information from the environment. Further, to reduce task path planning's dependence on strong spatial reasoning, we introduce the self-posture-aware training strategy to break down long-term spatial reasoning into short-term. We build the EmbodiedPlan-20k dataset for grounded planning in embodied tasks. Our experiments on the dataset demonstrate that PlanAgent outperforms previous methods and all components are effective.

Leer artículo completoexternamente

Preguntar a la IA

Me gusta

Guardar

Ver artículo completo