ABSTRACT The development of intelligent vehicle perception systems has raised increasingly stringent requirements on the representation quality of 4D automotive millimetre‐wave radar point clouds (RPC). However, the extremely sparse and irregular nature of RPC leads to insufficient structural cues for reliable 3D object perception, especially when conventional voxel encoders rely on heuristic aggregation (e.g., max‐pooling), which limits feature expressiveness. In this paper, an attention‐based radar pillar representation and BEV fusion framework for 3D object detection is proposed. Firstly, a multiscale aggregation (MSA) module is designed to aggregate local radar points under multiple receptive‐field sizes, enabling robust local geometry modelling from sparse RPC. Secondly, a learnable attentive voxel encoding (LAVE) module is proposed to construct expressive voxel representations. In this module, a set of learnable latent vectors interact with neighbourhood point features via cross‐attention to adaptively encode voxel‐level features, whereas self‐attention is further applied across voxels in BEV space to capture intervoxel contextual dependencies and enhance global structural reasoning. Lastly, an adaptive gated BEV fusion (AGBF) module is designed to fuse radar and camera BEV features with spatially varying modality weights, exploiting cross‐modal complementarity whilst suppressing unreliable cues. Experiments conducted on the view‐of‐delft (VoD) dataset demonstrate the effectiveness of the proposed radar modelling and fusion strategy, yielding consistent improvements over representative baselines.
Wang et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: