• We propose Lift3Dreamer, a novel text-guided diffusion framework for controllable 3D scene generation without multi-view supervision. • We introduce a data-free training pipeline using large language and diffusion models to synthesize high-quality RGB-D-text triplets. A 2D inpainting diffusion model is fine-tuned using pseudo-3D supervision lifted from monocular images and metric depth. • Our method supports consistent novel view and multi-view synthesis guided by both geometry and text. Extensive experiments show state-of-the-art performance in rendering quality, geometric consistency, and text alignment. Text-guided novel view synthesis aims to generate controllable and semantically consistent images from a single input. However, existing methods often rely on pretrained diffusion models that lack geometric awareness, resulting in artifacts and inconsistencies in occluded or unobserved regions. In this work, we present Lift3Dreamer , a novel framework that fine-tunes a 2D inpainting diffusion model with pseudo-3D supervision derived from single-view RGB-D inputs. Specifically, we estimate depth from monocular images, lift them into 3D space, and simulate novel views via random camera motions. This process produces structured visibility masks that approximate real occlusions in 3D, which are then used to supervise the inpainting model with geometry-aware guidance. To scale training without paired data, we introduce a data-free pipeline combining large language models and text-to-image generation. Equipped with these components, Lift3Dreamer shows strong performance in both synthetic and real-world scenarios, producing visually coherent and geometrically consistent results. Moreover, the framework can be extended to text-guided 3D-aware generation tasks, bridging the gap between 2D diffusion and view-consistent scene synthesis.
Lin et al. (Sun,) studied this question.