What question did this study set out to answer?

This work aims to develop a method for generating high-quality 3D scenes from single images using text and depth information.

March 5, 2026Open Access

Lift3Dreamer: Boosting Text-Driven Novel View Synthesis via Lifted 3D Inpainting Model from Single Images.

Key Points

This work aims to develop a method for generating high-quality 3D scenes from single images using text and depth information.
Introduced a data-free training pipeline using language and diffusion models.
Fine-tuned a 2D inpainting model with pseudo-3D supervision from monocular images.
Lifted depth information into 3D and simulated novel views with random camera motions.
Achieved state-of-the-art performance in rendering quality and geometric consistency.
Demonstrated strong text alignment in generated scenes.
Proved effective in both synthetic and real-world scenarios.

Abstract

• We propose Lift3Dreamer, a novel text-guided diffusion framework for controllable 3D scene generation without multi-view supervision. • We introduce a data-free training pipeline using large language and diffusion models to synthesize high-quality RGB-D-text triplets. A 2D inpainting diffusion model is fine-tuned using pseudo-3D supervision lifted from monocular images and metric depth. • Our method supports consistent novel view and multi-view synthesis guided by both geometry and text. Extensive experiments show state-of-the-art performance in rendering quality, geometric consistency, and text alignment. Text-guided novel view synthesis aims to generate controllable and semantically consistent images from a single input. However, existing methods often rely on pretrained diffusion models that lack geometric awareness, resulting in artifacts and inconsistencies in occluded or unobserved regions. In this work, we present Lift3Dreamer , a novel framework that fine-tunes a 2D inpainting diffusion model with pseudo-3D supervision derived from single-view RGB-D inputs. Specifically, we estimate depth from monocular images, lift them into 3D space, and simulate novel views via random camera motions. This process produces structured visibility masks that approximate real occlusions in 3D, which are then used to supervise the inpainting model with geometry-aware guidance. To scale training without paired data, we introduce a data-free pipeline combining large language models and text-to-image generation. Equipped with these components, Lift3Dreamer shows strong performance in both synthetic and real-world scenarios, producing visually coherent and geometrically consistent results. Moreover, the framework can be extended to text-guided 3D-aware generation tasks, bridging the gap between 2D diffusion and view-consistent scene synthesis.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper