Abstract Text-to-3D scene generation is pivotal for digital content creation; however, existing methods often struggle with global consistency across views. We present 3DS-Gen, a modular “generate-then-reconstruct” framework that first produces a temporally coherent multi-view video prior and then reconstructs consistent 3D scenes using sparse geometry estimation and Gaussian optimization. A cascaded variational autoencoder (2D for spatial compression and 3D for temporal compression) provides a compact and coherent latent sequence that facilitates robust reconstruction. An adaptive density threshold improves detailed allocation in the Gaussian stage under a fixed computational budget. While explicit meshes can be extracted from the optimized representation when needed, our claims emphasize multiview consistency and reconstructability; the mesh quality depends on the video prior and the chosen explicitification backend. 3DS-Gen runs on a single GPU and yields coherent scene reconstructions across diverse prompts, thereby providing a practical bridge between text and 3D content creation.
Building similarity graph...
Analyzing shared references across papers
Loading...
Zuan Gu
Tianhan Gao
Huimin Liu
Visual Computing for Industry Biomedicine and Art
Northeastern University
Building similarity graph...
Analyzing shared references across papers
Loading...
Gu et al. (Thu,) studied this question.
www.synapsesocial.com/papers/69449a892f0218eca95084c8 — DOI: https://doi.org/10.1186/s42492-025-00210-0