Corner cases, such as severe weather and abnormal lighting, present significant challenges in autonomous driving. The main obstacles involve large-scale data collection and costly annotations. Leveraging generative models to expand corner-case data based on existing annotations offers a promising solution. Unlike monocular videos, multi-view videos introduce an additional "view" dimension, increasing the consistency requirements and making precise control of annotations more challenging. Existing methods decouple multi-view videos along the temporal and view-spatial axes, using separate attention mechanisms, which causes motion discrepancies and limits consistency. Additionally, current approaches employ an independent adapter or ControlNet to encode different 3D annotations, leading to high computational costs and suboptimal alignment between annotations and video latents. These issues arise from neglecting the temporal-spatial relationship and insufficient alignment between 3D annotations and video latents. To address these challenges, we propose DriveGen, which uses 4D position embeddings to encode the positional information of multi-view videos. DriveGen also designs Dual-Scale Full Attention to ensure both global and local spatiotemporal consistency. Furthermore, our Shared Video-Condition Encoding (SVCE) Mechanism converts 3D annotations into 2D masks and encodes both video and annotation sequences using a 3D VAE, requiring only 0.37 M learnable parameters to achieve pixel-level alignment and improving generation quality. Numerous experiments have proven that DriveGen has reached the state-of-the-art, capable of generating high-quality controlled autonomous driving videos.
Kang et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: