LiftWM technical report / preprint. World models trained on 2D next-frame prediction can achieve low image-level error while failing to represent view-consistent 3D scenes, causing weak object permanence, poor novel-view consistency, and geometric drift. This work addresses how a world model can acquire 3D-aware latent structure without direct 3D labels. The proposed framework uses a persistent 3D spatial state and two complementary training signals from 2D observations. First, Geometric Lifting Objectives (GLO) are self-supervised losses extracting implicit 3D supervision from video via pseudo-geometry from pretrained models including DUSt3R, MASt3R, and Depth Anything V2, encompassing multi-view consistency, epipolar-constrained attention, and depth-normal coherence. Second, 3D Generative Prior Distillation uses outputs from Zero-1-to-3, MVDream, and 3D Gaussian Splatting as pseudo-3D supervisory signals. Theoretically, a 3D-awareness gap is formalized: under restricted camera support, standard 2D prediction losses do not identify a unique view-consistent 3D explanation, and auxiliary geometric information reduces this ambiguity. A variational interpretation shows the augmented criterion is an ELBO on an expanded model with auxiliary pseudo-observations. Empirically, adding GLO improves novel-view PSNR from 21.84 to 24.91 dB, reduces depth RMSE from 0.184 to 0.119, and lowers epipolar error from 2.73 to 1.41 pixels over a 2D-only baseline. Prior distillation further improves PSNR to 25.67 dB and depth RMSE to 0.108. Increasing the primitive budget from 128 to 512 improves persistence score from 0.71 to 0.86. These results demonstrate that implicit geometric supervision and distilled 3D priors can substantially narrow the gap between video prediction quality and persistent 3D scene understanding without explicit ground-truth 3D labels. Existing OSF archival DOI: 10.17605/OSF.IO/53FNR; Existing OSF archival page: https://osf.io/53fnr/. Files include the technical report PDF and the LaTeX source tarball when available.
Haopeng Jin (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: