What question did this study set out to answer?

The aim is to design a novel perception model, Sparse4D, for enhanced efficiency in autonomous driving tasks.

May 1, 2026

Sparse4D: Sparse-based End-to-end Multi-Sensor Temporal Perception.

Key Points

The aim is to design a novel perception model, Sparse4D, for enhanced efficiency in autonomous driving tasks.
Developed Sparse4D framework for end-to-end sparse perception.
Implemented deformable aggregation for spatial modelling and recurrent instance feature propagation for temporal modelling.
Conducted extensive experiments on the nuScenes benchmark for multi-object tracking tasks.
Achieved state-of-the-art performance in multi-camera 3D detection and tracking tasks.
Outperformed existing algorithms in training and inference efficiency.
Extended Sparse4D to a multi-modal model with improved generalization.

Abstract

In the field of autonomous driving, the structural design of perception models is of paramount importance. Unlike mainstream BEV-based algorithms, we propose a novel end-to-end sparse perception framework, Sparse4D, to achieve better performance and higher efficiency. Starting from the sparsity of perception results, we define an instance that decouples implicit features and explicit anchors, using the instance as the core for feature fusion to accomplish perception tasks. For spatial modelling, we develop a novel operator called deformable aggregation, which enables the transfer of information from the dense image feature to the sparse instance. For temporal modelling, we design a recurrent instance feature propagation structure, which not only realizes long-term feature fusion but also ensures the computational efficiency of the temporal module. Lastly, we explored the performance of Sparse4D in multi-object tracking tasks and proposed a minimalist joint detection and tracking model. We conduct extensive experimental validation of Sparse4D on the nuScenes benchmark. Sparse4D achieved state-of-the-art performance in multi-camera 3D detection and tracking tasks, and it also outperformed other algorithms in terms of training and inference efficiency. Furthermore, we extended Sparse4D to a multi-modal model, achieving excellent performance and better generalization.

Bookmark

Sparse4D: Sparse-based End-to-end Multi-Sensor Temporal Perception.

Key Points

Abstract

Cite This Study