What question did this study set out to answer?

The aim is to develop a framework that enables world models to acquire 3D awareness from 2D observations without direct 3D labeling.

April 25, 2026Open Access

LiftWM: Geometric Lifting and 3D Prior Distillation for Learning Persistent 3D-Aware World Models from 2D Observations

Key Points

The aim is to develop a framework that enables world models to acquire 3D awareness from 2D observations without direct 3D labeling.
Implemented Geometric Lifting Objectives for self-supervised loss extraction from 2D video data.
Employed 3D Generative Prior Distillation using outputs from various pre-trained models as supervisory signals.
Analyzed improvements in performance metrics such as PSNR and depth RMSE.
Adding Geometric Lifting Objectives improved novel-view PSNR from 21.84 to 24.91 dB.
Depth RMSE decreased from 0.184 to 0.119, and epipolar error reduced from 2.73 to 1.41 pixels over a 2D-only baseline.
Prior distillation further enhanced PSNR to 25.67 dB and depth RMSE to 0.108.

Abstract

LiftWM technical report / preprint. World models trained on 2D next-frame prediction can achieve low image-level error while failing to represent view-consistent 3D scenes, causing weak object permanence, poor novel-view consistency, and geometric drift. This work addresses how a world model can acquire 3D-aware latent structure without direct 3D labels. The proposed framework uses a persistent 3D spatial state and two complementary training signals from 2D observations. First, Geometric Lifting Objectives (GLO) are self-supervised losses extracting implicit 3D supervision from video via pseudo-geometry from pretrained models including DUSt3R, MASt3R, and Depth Anything V2, encompassing multi-view consistency, epipolar-constrained attention, and depth-normal coherence. Second, 3D Generative Prior Distillation uses outputs from Zero-1-to-3, MVDream, and 3D Gaussian Splatting as pseudo-3D supervisory signals. Theoretically, a 3D-awareness gap is formalized: under restricted camera support, standard 2D prediction losses do not identify a unique view-consistent 3D explanation, and auxiliary geometric information reduces this ambiguity. A variational interpretation shows the augmented criterion is an ELBO on an expanded model with auxiliary pseudo-observations. Empirically, adding GLO improves novel-view PSNR from 21.84 to 24.91 dB, reduces depth RMSE from 0.184 to 0.119, and lowers epipolar error from 2.73 to 1.41 pixels over a 2D-only baseline. Prior distillation further improves PSNR to 25.67 dB and depth RMSE to 0.108. Increasing the primitive budget from 128 to 512 improves persistence score from 0.71 to 0.86. These results demonstrate that implicit geometric supervision and distilled 3D priors can substantially narrow the gap between video prediction quality and persistent 3D scene understanding without explicit ground-truth 3D labels. Existing OSF archival DOI: 10.17605/OSF.IO/53FNR; Existing OSF archival page: https://osf.io/53fnr/. Files include the technical report PDF and the LaTeX source tarball when available.

Read Full Paperexternally

Ask AI

Mark Helpful

Bookmark

Relay

View Full Paper