Direct 3D scene stylization from sparse views remains a significant challenge, as existing optimization-based methods are prohibitively slow and require dense inputs to prevent geometric corruption. While recent direct methods accelerate this process, their rigid decoupling of a static geometry from appearance often leads to visual artifacts, where stylistic textures conflict with and distort the underlying scene structure. To address these limitations, we introduce GeoStyler, a direct framework that generates high-fidelity, multi-view consistent stylized 3D scenes in seconds. Our approach reformulates the conventional pipeline by first leveraging a diffusion model to generate a set of geometrically consistent stylized 2D images. The core of this stage is a novel hybrid query formulation for the self-attention mechanism. Specifically, cross-view geometric information is directly embedded into the query to enforce 3D consistency, while style information is independently injected via the key and value to preserve scene structure. This process is further stabilized by a geometrically-aware latent initialization that provides a coherent starting point for the denoising process. Subsequently, a decoupled reconstruction network lifts these 2D stylized images to 3D Gaussians. A geometry branch predicts a robust 3D scaffold from the original content images, while a parallel style branch predicts the final appearance from our generated stylized images, ensuring structural integrity is not compromised. Extensive experiments on large-scale benchmarks, including RealEstate10K and ACID, demonstrate that GeoStyler significantly outperforms prior arts in stylization quality and multi-view consistency, achieving state-of-the-art performance with a dramatic speedup. Our project page: https://huhuhuxiao. github.io/Geo-Styler/.
Hu et al. (Thu,) studied this question.