H-JEPA is a hierarchical Joint-Embedding Predictive Architecture that learns a world model and plans entirely in representation space, without reconstructing the observation it predicts. It departs from the I-JEPA and V-JEPA family in three ways: No EMA target. The frozen Exponential Moving Average target encoder is removed. Context and target encoders are trained jointly, and representation collapse is prevented by an explicit Variance-Invariance-Covariance Regularization (VICReg) objective rather than by architectural asymmetry. Multi-scale temporal prediction. Two depths of a one-dimensional Vision Transformer feed two predictors: a shallow tap for short-horizon kinematics and a deeper tap that forecasts a single macro-event 25 frames ahead under temporal dilation. Action-conditioned forward dynamics. Both predictors are conditioned on applied forces, so the model answers counterfactual questions and can be used for planning. After training, the high-level predictor is frozen and a Cross-Entropy Method (CEM) planner searches action sequences in the 256-dimensional latent space to reach a goal embedding — with no reward and no decoding. The model is instantiated on a three-body bounded kinematic simulator with mass-dependent elastic collisions and trained for 300 epochs on roughly 49,000 sequences using the Apple Silicon unified-memory (MPS) backend. The composite objective falls from about 121 to about 40; the variance and covariance diagnostics indicate a non-degenerate, decorrelated representation; and a linear probe recovers entity positions from the shallow latent. This 13-page technical report gives the full architecture, the training behaviour, the diagnostic protocol, and an appendix with the complete reference implementation (simulator, encoders, predictors, VICReg loss, CEM planner, and training loop). It is an open, self-contained study of how causal physical structure and latent planning can be approached without generative reconstruction and without a bootstrapped target. Topics: world models, JEPA, energy-based models, VICReg, self-supervised representation learning, representation collapse, model-based planning, cross-entropy method, Vision Transformer, latent dynamics, physical reasoning. Cite as: R. A. Patil, "Micro-World Models: Energy-Based Hierarchical Joint-Embedding Predictive Architectures for Continuous Kinematic Planning," Zenodo, May 2026, doi: 10.5281/zenodo.20403374.
Building similarity graph...
Analyzing shared references across papers
Loading...
Rishabh Ashok Patil
Building similarity graph...
Analyzing shared references across papers
Loading...
Rishabh Ashok Patil (Sun,) studied this question.
synapsesocial.com/papers/6a1e72cb30b38c64201b5fa5 — DOI: https://doi.org/10.5281/zenodo.20480620