Abstract Reconstructing detailed geometry and realistic appearance from a single RGB image is essential yet fundamentally challenging due to inherent ambiguities such as occlusion, lighting variations, and texture‐geometry entanglement. While recent diffusion‐based generative models have significantly improved novel view synthesis, existing approaches suffer from two critical limitations: lack of cross‐view geometric consistency and insufficient cross‐domain semantic alignment. To address these issues, we introduce U ni C ross 3D , a unified cross‐view and cross‐domain diffusion framework designed explicitly for consistent and physically coherent 3D generation. U ni C ross 3D features two novel contributions: (1) a cross‐view latent regularization that enforces cross‐view geometric consistency across synthesized viewpoints by penalizing latent variance, and (2) a cross‐domain mutual information objective grounded in the physics of image formation, explicitly aligning synthesized color and normal maps. Extensive experiments demonstrate that U ni C ross 3D achieves significantly improved view consistency and semantic alignment over state‐of‐the‐art methods and yields higher‐fidelity reconstructions, particularly under challenging textures and ambiguous viewpoints.
Jun et al. (Tue,) studied this question.