What question did this study set out to answer?

To improve the representation quality of automotive radar point clouds for better 3D object detection.

May 21, 2026Open Access

CRAVEN: A Camera–Radar Attention‐Based Voxel Encoding Network for 3D Object Detection

Key Points

To improve the representation quality of automotive radar point clouds for better 3D object detection.
Designed a multiscale aggregation module to capture local radar geometry.
Developed a learnable attentive voxel encoding module for enhanced feature representation.
Implemented an adaptive gated BEV fusion module for radar and camera feature integration.
The proposed framework demonstrated significant improvements in 3D object detection accuracy.
Consistent enhancements were observed over baseline methods, as validated on the view-of-delft dataset.
Effective feature fusion reduced reliance on less reliable cues from radar signals.

Abstract

ABSTRACT The development of intelligent vehicle perception systems has raised increasingly stringent requirements on the representation quality of 4D automotive millimetre‐wave radar point clouds (RPC). However, the extremely sparse and irregular nature of RPC leads to insufficient structural cues for reliable 3D object perception, especially when conventional voxel encoders rely on heuristic aggregation (e.g., max‐pooling), which limits feature expressiveness. In this paper, an attention‐based radar pillar representation and BEV fusion framework for 3D object detection is proposed. Firstly, a multiscale aggregation (MSA) module is designed to aggregate local radar points under multiple receptive‐field sizes, enabling robust local geometry modelling from sparse RPC. Secondly, a learnable attentive voxel encoding (LAVE) module is proposed to construct expressive voxel representations. In this module, a set of learnable latent vectors interact with neighbourhood point features via cross‐attention to adaptively encode voxel‐level features, whereas self‐attention is further applied across voxels in BEV space to capture intervoxel contextual dependencies and enhance global structural reasoning. Lastly, an adaptive gated BEV fusion (AGBF) module is designed to fuse radar and camera BEV features with spatially varying modality weights, exploiting cross‐modal complementarity whilst suppressing unreliable cues. Experiments conducted on the view‐of‐delft (VoD) dataset demonstrate the effectiveness of the proposed radar modelling and fusion strategy, yielding consistent improvements over representative baselines.

Read Full Paperexternally

AI से पूछें

Bookmark

View Full Paper

Cite This Study

Wang et al. (Thu,) studied this question.

synapsesocial.com/papers/6a0ea10ebe05d6e3efb5f645 https://doi.org/https://doi.org/10.1049/rsn2.70166

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

AI से पूछें

Bookmark

View Full Paper