ABSTRACT Deep learning has demonstrated significant promise in 3D content generation; however, current methods frequently exhibit limited robustness in complex scenes, generate low‐resolution outputs, and achieve unsatisfactory mean opinion scores (MOS). To address these limitations, this paper proposes a diffusion‐based approach that reframes novel view synthesis as an image restoration problem—specifically, by reformulating multiview 3D image generation as a conditional inpainting task to improve geometric consistency and visual fidelity. The proposed method supports robust 3D content generation at 2K resolution, preserves fine texture details up to 16K, and attains an MOS of 3.83 on 4K two‐view 3D displays.
Gu et al. (Tue,) studied this question.