What does this research mean for the field?

DriveGen, a novel generative model utilizing 4D position embeddings and a Shared Video-Condition Encoding mechanism, achieves state-of-the-art performance in generating high-quality, spatiotemporally consistent multi-view videos for autonomous driving with minimal learnable parameters. Novelty: ClaimNovelty.METHODOLOGICAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

This study aims to address challenges in generating high-quality multi-view videos for autonomous driving, particularly under corner-case scenarios.

June 5, 2026

DriveGen: Shared Video-Condition Encoding for Autonomous Multi-View Video Generation

Key Points

This study aims to address challenges in generating high-quality multi-view videos for autonomous driving, particularly under corner-case scenarios.
Proposed DriveGen uses 4D position embeddings for encoding multi-view videos.
Implemented Dual-Scale Full Attention to ensure global and local spatiotemporal consistency.
Developed Shared Video-Condition Encoding to convert 3D annotations into 2D masks and encode video sequences.
Achieved pixel-level alignment with only 0.37 M learnable parameters.
Reached state-of-the-art generation quality for controlled autonomous driving videos through experiments.

Abstract

Corner cases, such as severe weather and abnormal lighting, present significant challenges in autonomous driving. The main obstacles involve large-scale data collection and costly annotations. Leveraging generative models to expand corner-case data based on existing annotations offers a promising solution. Unlike monocular videos, multi-view videos introduce an additional "view" dimension, increasing the consistency requirements and making precise control of annotations more challenging. Existing methods decouple multi-view videos along the temporal and view-spatial axes, using separate attention mechanisms, which causes motion discrepancies and limits consistency. Additionally, current approaches employ an independent adapter or ControlNet to encode different 3D annotations, leading to high computational costs and suboptimal alignment between annotations and video latents. These issues arise from neglecting the temporal-spatial relationship and insufficient alignment between 3D annotations and video latents. To address these challenges, we propose DriveGen, which uses 4D position embeddings to encode the positional information of multi-view videos. DriveGen also designs Dual-Scale Full Attention to ensure both global and local spatiotemporal consistency. Furthermore, our Shared Video-Condition Encoding (SVCE) Mechanism converts 3D annotations into 2D masks and encodes both video and annotation sequences using a 3D VAE, requiring only 0.37 M learnable parameters to achieve pixel-level alignment and improving generation quality. Numerous experiments have proven that DriveGen has reached the state-of-the-art, capable of generating high-quality controlled autonomous driving videos.

Bookmark

DriveGen: Shared Video-Condition Encoding for Autonomous Multi-View Video Generation

Key Points

Abstract

Cite This Study

Also Consider

Also Consider