Connecting the semantic reasoning of vision-language models (VLMs) to the precise geometric demands of robotic manipulation remains a fundamental challenge. Although VLMs can interpret high-level commands, they lack the intrinsic spatial intelligence required for tasks demanding precise object placement, orientation, and physical reasoning. Here, we introduce Retrieval-Augmented Manipulation (RAM), an object-centric framework that endows general-purpose vision foundation models with the spatial reasoning necessary for robust manipulation. RAM bridges the semantic-to-geometric gap by grounding abstract concepts into an explicit, object-centric three-dimensional (3D) representation. This grounded information is then provided as augmented context to the VLM, empowering it to decompose complex instructions into a sequence of spatially precise and physically plausible subgoals. We demonstrate that RAM, in a zero-shot setting on a real-world robot, can execute these subgoals to fulfill complex spatial language instructions, complete spatially aware manipulation under the guidance of a single 2D image, and adaptively replan tasks by reasoning about physical constraints like object size and collisions. Quantitative evaluations on the Common Object in 3D (CO3D) dataset also validated that RAM's core vision module generalizes to previously unseen object categories and is robust to variations in shape and occlusions. By providing a structured bridge between semantic intent and geometric execution, RAM represents a critical step toward developing more physically intelligent and general-purpose robotic systems.
Chen et al. (Wed,) studied this question.