What question did this study set out to answer?

This research aims to develop a novel hierarchical predictive architecture for kinematic planning without generative reconstruction.

June 2, 2026Open Access

Micro-World Models: Energy-Based Hierarchical Joint-Embedding Predictive Architectures for Continuous Kinematic Planning

Read Full Paperexternally

Key Points

This research aims to develop a novel hierarchical predictive architecture for kinematic planning without generative reconstruction.
Introduced a Joint-Embedding Predictive Architecture with interlinked context and target encoders.
Utilized multi-scale temporal prediction with a one-dimensional Vision Transformer for varying prediction horizons.
Implemented a Cross-Entropy Method planner to search action sequences in a latent representation space.
Composite objective reduced from approximately 121 to 40, reflecting improved model performance.
Variance and covariance diagnostics confirmed a non-degenerate and decorrelated representation.
Successful recovery of entity positions from shallow latent representations using a linear probe.

Abstract

H-JEPA is a hierarchical Joint-Embedding Predictive Architecture that learns a world model and plans entirely in representation space, without reconstructing the observation it predicts. It departs from the I-JEPA and V-JEPA family in three ways: No EMA target. The frozen Exponential Moving Average target encoder is removed. Context and target encoders are trained jointly, and representation collapse is prevented by an explicit Variance-Invariance-Covariance Regularization (VICReg) objective rather than by architectural asymmetry. Multi-scale temporal prediction. Two depths of a one-dimensional Vision Transformer feed two predictors: a shallow tap for short-horizon kinematics and a deeper tap that forecasts a single macro-event 25 frames ahead under temporal dilation. Action-conditioned forward dynamics. Both predictors are conditioned on applied forces, so the model answers counterfactual questions and can be used for planning. After training, the high-level predictor is frozen and a Cross-Entropy Method (CEM) planner searches action sequences in the 256-dimensional latent space to reach a goal embedding — with no reward and no decoding. The model is instantiated on a three-body bounded kinematic simulator with mass-dependent elastic collisions and trained for 300 epochs on roughly 49,000 sequences using the Apple Silicon unified-memory (MPS) backend. The composite objective falls from about 121 to about 40; the variance and covariance diagnostics indicate a non-degenerate, decorrelated representation; and a linear probe recovers entity positions from the shallow latent. This 13-page technical report gives the full architecture, the training behaviour, the diagnostic protocol, and an appendix with the complete reference implementation (simulator, encoders, predictors, VICReg loss, CEM planner, and training loop). It is an open, self-contained study of how causal physical structure and latent planning can be approached without generative reconstruction and without a bootstrapped target. Topics: world models, JEPA, energy-based models, VICReg, self-supervised representation learning, representation collapse, model-based planning, cross-entropy method, Vision Transformer, latent dynamics, physical reasoning. Cite as: R. A. Patil, "Micro-World Models: Energy-Based Hierarchical Joint-Embedding Predictive Architectures for Continuous Kinematic Planning," Zenodo, May 2026, doi: 10.5281/zenodo.20403374.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Rishabh Ashok Patil

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Micro-World Models: Energy-Based Hierarchical Joint-Embedding Predictive Architectures for Continuous Kinematic Planning

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study