Autonomous driving necessitates a robust 3D perception system that includes accurate object detection, tracking, and segmentation. While recent low-cost camera-based methods have demonstrated promising results, these systems are prone to performance degradation under poor lighting conditions or adverse weather, resulting in considerable localization errors. In this paper, we present a novel approach called Frequency-aware Depth Association Radar-Camera (FDARC) Fusion. This method aims to generate semantically rich and spatially accurate Bird’s-Eye-View (BEV) feature maps by integrating data from both camera and radar sensors. Initially, the image features are enhanced using frequency-aware techniques. Subsequently, these features are transformed into BEV representation with the assistance of depth information estimated from both sensor modalities and radar measurements. This process, known as Depth Association (DA), facilitates more precise BEV representations. Following this, a Temporal and Deformable Cross-Fusion (TDCF) layer is utilized to encode multi-modal feature maps into a unified space-time dimension representation. Extensive experiments conducted on the nuScenes dataset show that FDARC achieves state-of-the-art performance in 3D detection tasks, markedly outperforming baseline models on the nuScenes val set using a ResNet-50 backbone, which attains 53.5% nuScenes Detection Score (NDS) and 44.7% mean Average Precision (mAP).
Wang et al. (Tue,) studied this question.