Key points are not available for this paper at this time.
For autonomous vehicles (AVs), the ability for effective end-to-end perception and future trajectory prediction is critical in planning a safe automatic maneuver. In the current AVs systems, perception and prediction are two separate modules. The prediction module receives only a restricted amount of information from the perception module. Furthermore, perception errors will propagate into the prediction module, ultimately having a negative impact on the accuracy of the prediction results. In this paper, we present a novel framework termed BEV-TP, a visual context-guided center-based transformer network for joint 3D perception and trajectory prediction. BEV-TP exploits visual information from consecutive multi-view images and context information from HD semantic maps, to predict better objects’ centers whose locations are then used to query visual features and context features via the attention mechanism. Generated agent queries and map queries facilitate learning of the transformer module for further feature aggregation. Finally, multiple regression heads are used to perform 3D bounding box detection and future velocity prediction. This center-based approach achieves a differentiable, simple, and efficient E2E trajectory prediction framework. Extensive experiments conducted on the nuScenes dataset demonstrate the effectiveness of BEV-TP over traditional pipelines with sequential paradigms.
Lang et al. (Mon,) studied this question.