What does this research mean for the field?

A fully sparse framework utilizing scale-adaptive attention and adaptive sampling for multi-view 3D object detection outperforms dense detectors in both accuracy and inference speed. Novelty: ClaimNovelty.METHODOLOGICAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

The aim is to enhance the performance of 3D object detection in BEV space using a fully sparse detection framework.

April 4, 2026

SparseBEV: A Fully Sparse Framework for Multi-View 3D Object Detection

Key Points

The aim is to enhance the performance of 3D object detection in BEV space using a fully sparse detection framework.
Proposed a fully sparse 3D object detector without dense BEV feature construction.
Implemented scale-adaptive self attention for feature aggregation in BEV space.
Developed sampling-point-based multi-frame stacking and query-based recurrent temporal fusion for temporal modeling.
SparseBEV achieves a performance of 55.8 NDS at a speed of 23.5 FPS.
SparseBEV++ further enhances performance to 57.1 NDS at 24.6 FPS.
On the Waymo dataset, SparseBEV++ outperforms previous methods with 58.9 mAP and 55.2 mAPH.

Abstract

Camera-based 3D object detection in BEV (Bird's Eye View) space has drawn great attention over the past few years. Dense detectors typically follow a two-stage pipeline by first constructing a dense BEV feature and then performing object detection in BEV space, which suffers from complex view transformations and high computation costs. On the other side, sparse detectors follow a query-based paradigm without explicit dense BEV feature construction but generally underperform compared to dense ones. In this paper, we find that the key to mitigating this performance gap is the adaptability of the detector in both BEV and image space. To this end, we propose a fully sparse 3D object detector that outperforms the dense counterparts and enjoys a higher running speed. Our sparse detector contains three key designs, which are (1) scale-adaptive self attention to aggregate features with adaptive receptive field in BEV space, (2) scale-adaptive cross attention to capture the unique temporal dynamics associated with different objects, (3) adaptive sampling and mixing to perform interactions between queries and image features under the guidance of queries. These key components enhance the adaptability of the detector in both BEV and image space. Furthermore, we explore two distinct temporal modeling approaches: sampling-point-based multi-frame stacking (dubbed SparseBEV) and query-based recurrent temporal fusion (dubbed SparseBEV++) to leverage temporal features effectively. Experiments are conducted on the nuScenes and Waymo datasets. On the val split of nuScenes, both SparseBEV and SparseBEV++ surpass all previous methods. Our SparseBEV achieves a performance of 55.8 NDS and a speed of 23.5 FPS, and SparseBEV++ further achieves a remarkable 57.1 NDS while maintaining a real-time inference speed of 24.6 FPS. On the Waymo dataset, our best-performing model, SparseBEV++, outperforms previous methods with a lead of 58.9 mAP and 55.2 mAPH.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Yang Chen

Haisong Liu

Limin Wang

Journals

IEEE Transactions on Pattern Analysis and Machine Intelligence

Actions

Institutions

Nanjing University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

SparseBEV: A Fully Sparse Framework for Multi-View 3D Object Detection

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study