Multi-view 3D reconstruction is essential for smart city, supporting applications such as smart city planning and autonomous navigation. While traditional reconstruction pipelines and recent neural implicit methods, such as NeRF, achieve high visual fidelity, they often struggle with geometric accuracy and sparse-view scenarios. To address this challenge, we present DiffLiGS, a novel multi-modal 3D reconstruction framework that integrates LiDAR point clouds and LiDAR-guided diffusion-based priors into the 3D Gaussian Splatting (3DGS) pipeline, enabling high-fidelity and geometrically accurate models. Our method first densifies sparse LiDAR depths using a diffusion model and refines them through multi-view geometric constraints, producing dense LiDAR depth maps that provide robust supervision for 3DGS optimization. Leveraging these dense depth maps, we guide a Stable Video Diffusion model to synthesize novel view images, which are incorporated into training to enhance reconstruction completeness and visual realism. By jointly fusing rich appearance cues from multi-view images with precise LiDAR-derived geometry and diffusion priors, DiffLiGS achieves unified, geometry-aware 3D scene representations. Our extensive experiments demonstrate that our approach significantly improves both geometric accuracy and rendering quality compared to existing 3D reconstruction methods, enabling real-time, high-precision modeling of complex urban environments.
Gong et al. (Tue,) studied this question.