What question did this study set out to answer?

The aim is to enhance vision-language models' spatial reasoning for better robotic manipulation.

May 2, 2026

A retrieval-augmented framework enabling VLM spatial awareness for object-centric robot manipulation.

Key Points

The aim is to enhance vision-language models' spatial reasoning for better robotic manipulation.
Introduced Retrieval-Augmented Manipulation (RAM) framework.
Demonstrated RAM's performance in a zero-shot setting on a real-world robot.
Evaluated RAM using the Common Object in 3D (CO3D) dataset.
RAM can execute complex spatial language instructions successfully, utilizing just a 2D image.
RAM adapts tasks by reasoning about physical constraints, such as object size.
Quantitative evaluations show RAM generalizes to new object categories and handles shape variations well.

Abstract

Connecting the semantic reasoning of vision-language models (VLMs) to the precise geometric demands of robotic manipulation remains a fundamental challenge. Although VLMs can interpret high-level commands, they lack the intrinsic spatial intelligence required for tasks demanding precise object placement, orientation, and physical reasoning. Here, we introduce Retrieval-Augmented Manipulation (RAM), an object-centric framework that endows general-purpose vision foundation models with the spatial reasoning necessary for robust manipulation. RAM bridges the semantic-to-geometric gap by grounding abstract concepts into an explicit, object-centric three-dimensional (3D) representation. This grounded information is then provided as augmented context to the VLM, empowering it to decompose complex instructions into a sequence of spatially precise and physically plausible subgoals. We demonstrate that RAM, in a zero-shot setting on a real-world robot, can execute these subgoals to fulfill complex spatial language instructions, complete spatially aware manipulation under the guidance of a single 2D image, and adaptively replan tasks by reasoning about physical constraints like object size and collisions. Quantitative evaluations on the Common Object in 3D (CO3D) dataset also validated that RAM's core vision module generalizes to previously unseen object categories and is robust to variations in shape and occlusions. By providing a structured bridge between semantic intent and geometric execution, RAM represents a critical step toward developing more physically intelligent and general-purpose robotic systems.

AIに質問

Bookmark

AIに質問

Bookmark

A retrieval-augmented framework enabling VLM spatial awareness for object-centric robot manipulation.

Key Points

Abstract

Cite This Study