Key points are not available for this paper at this time.
World models—internal predictive representations that enable agents to simulate future states, anticipate consequences, and plan actions—have emerged as a foundational paradigm in embodied artificial intelligence. Originating from model-based reinforcement learning, this field has undergone a radical transformation with the advent of large-scale generative models, blurring the historical boundary between passive video prediction and interactive physical simulation. Concurrently, Vision-Language-Action (VLA) models have established a powerful framework for grounding high-level linguistic intent in low-level motor control. The natural convergence of these two threads—predictive world simulation and action-grounded multimodal reasoning—has given rise to Embodied World Action Models (WAMs), representing a new frontier in which agents learn to act by imagining their futures. However, the explosive growth of methods across robotics, autonomous driving, and interactive simulation has produced a fragmented landscape that lacks systematic unification. This survey presents a comprehensive and structured review of the modern world model ecosystem, encompassing 200+ key papers organized into a unified taxonomy. We systematically cover six major pillars: (i) Foundation World Models, including general-purpose interactive simulators (Genie, Cosmos, Sora) and game-specific environments (Oasis, Matrix-Game); (ii) Vision-Language-Action Models, spanning foundational architectures (RT-2, π₀, OpenVLA), driving-specific VLAs, and embodied manipulation policies; (iii) Embodied World Action Models, unifying video generation and action prediction through zero-shot policies, controllable simulation platforms, and world model-based reinforcement learning; (iv) Autonomous Driving World Models, addressing video generation, closed-loop simulation, planning policies, and geometric occupancy/BEV representations; (v) Efficiency and Evaluation, covering computational acceleration techniques and benchmarking protocols for physical plausibility; and (vi) Datasets and Ecosystems, including large-scale robot learning corpora and industry technical reports that underpin the entire field. Through this organization, we illuminate the evolutionary trajectory from passive pixel predictors to active, reasoning, and action-grounded simulators. We identify critical open challenges—including physical consistency, cross-embodiment generalization, safety verification, and the sim-to-real evaluation gap—and outline future directions toward cognitive world models, autonomous data collection, and standardized open ecosystems. This survey aims to serve as a definitive reference for researchers and practitioners advancing the next generation of embodied intelligence.
Building similarity graph...
Analyzing shared references across papers
Loading...
Xin Jin (Wed,) studied this question.
www.synapsesocial.com/papers/6a05685ca550a87e60a20ede — DOI: https://doi.org/10.5281/zenodo.20130369
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:
Xin Jin
Building similarity graph...
Analyzing shared references across papers
Loading...