Embodied navigation and manipulation are fundamental capabilities for embodied agents operating in physical environments. A key challenge in this process is understanding the spatial context and the affordances of the environment, which involves recognizing how objects can be interacted with (object affordance) and identifying suitable locations for movement and object placement (free space affordance). While Vision-Language Models (VLMs) have shown promise in high-level task planning, their ability to translate reasoning into precise executable actions remains limited, particularly in image-based spatial understanding and precise affordance localization-a critical gap in image processing for robotics. To bridge this gap, we propose EspA, a novel image-to-keypoint model that leverages spatial-aware affordance learning to predict actionable affordances directly from 2D image inputs. Built on a hierarchical vision-language architecture, EspA jointly reasons about object affordances and free space affordances, enabling pixel-level localization of both types of interactions. Crucially, EspA translates language instructions into precise 2D affordance keypoints from observed images, which are then projected into 3D actionable coordinates using depth information. To support this unified affordance reasoning, we introduce the Embodied Spatial Affordance (ESA) dataset, which captures both object-centric interactions and free space contexts. By jointly modeling these affordances in a shared representation space, EspA overcomes the limitations of prior works that treat them independently. The dataset's fine-grained annotations enable our model to learn the intricate relationship between object functionality and spatial feasibility, significantly enhancing the spatial understanding in embodied tasks. Extensive experimental results demonstrate that EspA outperforms existing state-of-the-art Vision-Language Models (VLMs), both open-source and closed-source, in object and free space affordance prediction. Furthermore, it exhibits superior performance in real-world embodied navigation and manipulation experiments. Our work advances the field of image-based spatial reasoning by providing a scalable solution for translating high-level instructions into low-level actionable affordances. We believe this work paves the way for more robust and versatile embodied agents capable of effectively interacting with complex environments. The dataset, benchmark, and evaluation code will be publicly available to facilitate future research. Project website: https://embodied-spatial-affordance.github.io/.
Hao et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: