What type of study is this?

September 10, 2025

Dynamic Alignment and Diffusion Models for Multi-Modal 3D Object Detection in Bird’s-Eye View

Key Points

DAD-Fusion improves 3D object detection performance, achieving 71.1% mAP and 73.4% NDS metrics.
This method effectively manages alignment errors and perceptual noise across multi-modal sensor inputs for better feature representation.
Dynamic alignment and a diffusion model contribute to enhanced detection capabilities, as validated by tests on the nuScenes and KITTI datasets.
The model designed for optimized BEV features supports end-to-end sharing and joint optimization with downstream tasks.

Abstract

Abstract 3D object detection is a core task in environmental perception for autonomous driving. Current multi-modal methods, which fuse features from various sensors such as LiDAR and cameras, have shown promise in enhancing detection performance to some extent. Nevertheless, these methods remain susceptible to factors such as calibration errors and noise interference in real-world scenarios. These issues lead to suboptimal alignment and fusion of multi-modal features, thereby degrading the model’s detection performance and generalization capability. To overcome these limitations, this paper proposes the DAD-Fusion method. This method generates a learnable offset field through a dynamic alignment module, which adaptively corrects the spatial misalignment of LiDAR and camera bird's eye view (BEV) features. Concurrently, a diffusion model is introduced to effectively suppress the perceptual noise in multi-modal fusion through a progressive denoising mechanism to enhance the feature representation capability. Extensive experiments show that the proposed DAD-Fusion achieves excellent performance on the nuScenes dataset, reaching 71.1% mAP and 73.4% NDS on the test set. To further validate its generalization capabilities, our model was also evaluated on the KITTI dataset, where it significantly outperforms the baseline method. In addition, our model generates optimized BEV features that can be shared end-to-end and jointly optimized with downstream tasks, contributing to the development of more efficient and robust perception and decision-making systems.

Ask AI

Helpful

Bookmark

View Full Paper