In the field of autonomous driving, the structural design of perception models is of paramount importance. Unlike mainstream BEV-based algorithms, we propose a novel end-to-end sparse perception framework, Sparse4D, to achieve better performance and higher efficiency. Starting from the sparsity of perception results, we define an instance that decouples implicit features and explicit anchors, using the instance as the core for feature fusion to accomplish perception tasks. For spatial modelling, we develop a novel operator called deformable aggregation, which enables the transfer of information from the dense image feature to the sparse instance. For temporal modelling, we design a recurrent instance feature propagation structure, which not only realizes long-term feature fusion but also ensures the computational efficiency of the temporal module. Lastly, we explored the performance of Sparse4D in multi-object tracking tasks and proposed a minimalist joint detection and tracking model. We conduct extensive experimental validation of Sparse4D on the nuScenes benchmark. Sparse4D achieved state-of-the-art performance in multi-camera 3D detection and tracking tasks, and it also outperformed other algorithms in terms of training and inference efficiency. Furthermore, we extended Sparse4D to a multi-modal model, achieving excellent performance and better generalization.
Lin et al. (Tue,) studied this question.