Improving the robustness of autonomous driving perception models relies on large-scale, diverse scenario data. However, real-world road data has challenges such as high collection costs, scarcity of extreme scenarios, and complexity in multi-view labeling. Generative AI scene synthesis technology has emerged as a key solution, with diffusion models gradually replacing GAN models as the mainstream. This paper provides a systematic review of autonomous driving scene synthesis technology, outlining the evolution of the technology, clarifying the core features and logic of different generations; it focuses on analyzing the representative solution DrivingDiffusion, the first video generation framework to achieve “3D layout controllability, multi-view coordination, and temporal coherence,” dissecting its architecture and core module design based on latent diffusion models (LDM). It further compares the performance of diffusion-based methods with traditional GAN-based approaches across key metrics like scene fidelity and label consistency. Moreover, it extracts the key issues and challenges in the current field; finally, it looks forward to future development directions, providing a reference for subsequent research on related virtual data generation.
Yu Liu (Mon,) studied this question.