What question did this study set out to answer?

This research aims to enhance the generation of 3D scenes from textual descriptions with improved consistency.

December 19, 2025Open Access

Text-to-3D scene generation framework: bridging textual descriptions to high-fidelity 3D scenes

Key Points

This research aims to enhance the generation of 3D scenes from textual descriptions with improved consistency.
Introduced a modular framework named 3DS-Gen
Employed a generate-then-reconstruct approach
Utilized sparse geometry estimation and Gaussian optimization for reconstruction
3DS-Gen yields coherent scene reconstructions across diverse prompts
Improvement in multiview consistency
Mesh quality is dependent on the video prior and explicitification backend

Abstract

Abstract Text-to-3D scene generation is pivotal for digital content creation; however, existing methods often struggle with global consistency across views. We present 3DS-Gen, a modular “generate-then-reconstruct” framework that first produces a temporally coherent multi-view video prior and then reconstructs consistent 3D scenes using sparse geometry estimation and Gaussian optimization. A cascaded variational autoencoder (2D for spatial compression and 3D for temporal compression) provides a compact and coherent latent sequence that facilitates robust reconstruction. An adaptive density threshold improves detailed allocation in the Gaussian stage under a fixed computational budget. While explicit meshes can be extracted from the optimized representation when needed, our claims emphasize multiview consistency and reconstructability; the mesh quality depends on the video prior and the chosen explicitification backend. 3DS-Gen runs on a single GPU and yields coherent scene reconstructions across diverse prompts, thereby providing a practical bridge between text and 3D content creation.

Text-to-3D scene generation framework: bridging textual descriptions to high-fidelity 3D scenes

Key Points

Abstract

Cite This Study