This paper proposes SpecBEV, an enhanced multi-view 3D object detection framework for autonomous driving using bird’s-eye-view (BEV) representations. Compared with LiDAR-based methods, multi-camera perception offers higher cost-effectiveness and flexibility. However, existing end-to-end BEV detectors suffer from illumination variations, occlusions, and cross-view inconsistencies during feature projection and fusion. These issues often introduce redundant background activations and geometric misalignment in the BEV space, leading to missed detections, false positives, and unstable localization. To address them, we introduce a frequency-prior spatial attention module (SA-Freq). It utilizes fixed discrete cosine transform (DCT) bases to model the multi-band responses of BEV features and produce spatial attention weights that suppress redundant activations and enhance target-related regions. We further design a cross-view feature alignment module (CFA) to ensure consistency between single-view BEV features and the fused BEV representation, thereby reducing geometric inconsistency and improving localization stability. Experiments on the nuScenes validation set demonstrate that SpecBEV achieves 0.3856 in mAP and 0.4871 in NDS. Compared with the BEVDet baseline, it yields an absolute gain of 0.1028 (36.35% relative improvement) in mAP and an absolute gain of 0.1371 (39.17% relative improvement) in NDS, which validates the effectiveness of the proposed method.
Lin et al. (Wed,) studied this question.