What question did this study set out to answer?

This work aims to tackle the anchoring deficit in AI systems by establishing a Real-World Foundation Model that integrates physical intuition with cognitive processes.

June 11, 2026Open Access

From Anchoring Deficit to Embodied Anchoring: Native Understanding-Driven Real-World Foundation Model

Key Points

This work aims to tackle the anchoring deficit in AI systems by establishing a Real-World Foundation Model that integrates physical intuition with cognitive processes.
Introduced a framework comprising three layers: Perception, Cognition, and Execution, each addressing different aspects of real-world interaction.
Utilized a Spatiotemporal Physics Transformer to enhance physical understanding through multimodal event input and structured biases.
Proposed a dual-engine collaboration system involving a real-world substrate and a virtual AI foundation model, defining their decision-making boundaries.
Expected enhancement in AI’s ability to grasp physical attributes and context, reducing hallucination incidences resulting from symbolic constraints.
Real-world anchoring module proposed to serve initially as an anchoring plugin, leading to dual-engine functionality over time.
Identified core claims requiring experimental verification in future work to validate the proposed reference architecture.

Abstract

Abstract: Current AI Development Faces a Dual Fracture On the symbolic level, AI systems represented by virtual AI foundation models (such as large language models 大语言模型) ground their knowledge in the statistical correlations of symbolic data (符号数据), rather than in direct interaction with the real world (现实世界). This foundational anchoring deficit (锚定缺失) fundamentally renders the hallucination problem incurable at its root. On the geometric level, machine perception systems prioritize dense three-dimensional reconstruction, reducing the real world to a spatial occupancy problem and systematically neglecting essential attributes such as material properties, affordance (可供性), and temporal processes. These two dilemmas share a single root: the anchoring deficit between the symbolic and the real world. This paper proposes a unified "Real-World Foundation Model" (现实世界大模型) framework, composed of three mutually supporting threads. The Perception Layer (感知层) takes passive multimodal event streams (事件流) as its perceptual primitives, eliminates spatial ambiguity through centralized rigid multi-view arrangement and geometric verification, tolerates inter-modal physical delays via a multimodal parallel witness state machine (多模态并行见证状态机), and outputs complete, scene-wide unified event streams, providing the system with a real-world truth anchor independent of text. The Cognition Layer (认知层) uses a Spatiotemporal Physics Transformer (时空物理Transformer) as its backbone architecture, growing native physical intuition and a world model directly from multimodal event streams through factorized spatiotemporal attention, structured injection of physical inductive bias (物理归纳偏置), four core self-supervised pre-training tasks, common-sense distillation guided by large language models (LLMs), and an external explicit physical memory bank (外部显式物理记忆库). The Execution Layer (执行层) uses a distributed heterogeneous body network (分布式异构身体网络) as its perception-execution terminals, takes "whether the result is achieved" as the sole learning signal, conducts internal simulated operation (内部模拟操作) on this world model, and accumulates shareable, transferable, and referenceable operational knowledge. The three threads form a complete logical loop: the Perception Layer provides the real-world truth anchor, the Cognition Layer grows learnable and generalizable real-world representations upon this anchor, and the Execution Layer conducts operation simulation and accumulates operational knowledge upon these representations. The relationship between virtual AI foundation models and the real-world substrate (现实世界基座) is positioned as co-evolution (协同演化): in the near term, the real-world anchoring module can serve as an anchoring plugin invoked by the system's main brain to mitigate hallucinations; in the medium term, it can operate in parallel with the real-world substrate as a dual-engine collaboration system; in the long term, each can specialize. The real-world substrate excels at physical intuition and world simulation, while the virtual AI foundation model excels at linguistic interaction and abstract reasoning, together forming a complete cognitive ecology. The dual engines divide decision-making boundaries by domain jurisdiction (领域管辖权): questions of the real world are ultimately adjudicated by the real-world substrate; questions of linguistic interaction and planning are ultimately adjudicated by the virtual AI foundation model; non-specialists may offer suggestions to specialists but have no final decision-making authority. Statement of Nature: This paper is not a technical report of a completed system implementation, but a "System Challenge Paper" (系统挑战论文). Its contribution lies in identifying and defining a problem domain systematically neglected by mainstream paradigms, proposing a logically self-consistent reference architecture, and decomposing its core claims into a set of falsifiable hypotheses (see Appendix A). All core claims await experimental verification. I openly acknowledge the numerous unresolved challenges in engineering implementation, the phased compromises of the near-term pragmatic path, and the deep openness of the relationship between real-world and virtual cognitive engines. The most pragmatic entry point at present is to deploy the real-world anchoring module as a callable anchoring plugin within the virtual AI foundation model ecosystem, verifying the core cognitive claims at minimal engineering cost.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper