Camera-based 3D object detection in BEV (Bird's Eye View) space has drawn great attention over the past few years. Dense detectors typically follow a two-stage pipeline by first constructing a dense BEV feature and then performing object detection in BEV space, which suffers from complex view transformations and high computation costs. On the other side, sparse detectors follow a query-based paradigm without explicit dense BEV feature construction but generally underperform compared to dense ones. In this paper, we find that the key to mitigating this performance gap is the adaptability of the detector in both BEV and image space. To this end, we propose a fully sparse 3D object detector that outperforms the dense counterparts and enjoys a higher running speed. Our sparse detector contains three key designs, which are (1) scale-adaptive self attention to aggregate features with adaptive receptive field in BEV space, (2) scale-adaptive cross attention to capture the unique temporal dynamics associated with different objects, (3) adaptive sampling and mixing to perform interactions between queries and image features under the guidance of queries. These key components enhance the adaptability of the detector in both BEV and image space. Furthermore, we explore two distinct temporal modeling approaches: sampling-point-based multi-frame stacking (dubbed SparseBEV) and query-based recurrent temporal fusion (dubbed SparseBEV++) to leverage temporal features effectively. Experiments are conducted on the nuScenes and Waymo datasets. On the val split of nuScenes, both SparseBEV and SparseBEV++ surpass all previous methods. Our SparseBEV achieves a performance of 55.8 NDS and a speed of 23.5 FPS, and SparseBEV++ further achieves a remarkable 57.1 NDS while maintaining a real-time inference speed of 24.6 FPS. On the Waymo dataset, our best-performing model, SparseBEV++, outperforms previous methods with a lead of 58.9 mAP and 55.2 mAPH.
Building similarity graph...
Analyzing shared references across papers
Loading...
Yang Chen
Haisong Liu
Limin Wang
IEEE Transactions on Pattern Analysis and Machine Intelligence
Nanjing University
Building similarity graph...
Analyzing shared references across papers
Loading...
Chen et al. (Thu,) studied this question.
www.synapsesocial.com/papers/69d0ae68659487ece0fa4550 — DOI: https://doi.org/10.1109/tpami.2026.3679808