Abstract Text-to-3D scene generation is pivotal for digital content creation; however, existing methods often struggle with global consistency across views. We present 3DS-Gen, a modular “generate-then-reconstruct” framework that first produces a temporally coherent multi-view video prior and then reconstructs consistent 3D scenes using sparse geometry estimation and Gaussian optimization. A cascaded variational autoencoder (2D for spatial compression and 3D for temporal compression) provides a compact and coherent latent sequence that facilitates robust reconstruction. An adaptive density threshold improves detailed allocation in the Gaussian stage under a fixed computational budget. While explicit meshes can be extracted from the optimized representation when needed, our claims emphasize multiview consistency and reconstructability; the mesh quality depends on the video prior and the chosen explicitification backend. 3DS-Gen runs on a single GPU and yields coherent scene reconstructions across diverse prompts, thereby providing a practical bridge between text and 3D content creation.
Gu et al. (Thu,) studied this question.