Key points are not available for this paper at this time.
The ability to predict the future states of nearby traffic agents is critical for autonomous vehicles. Recently, it has become a new paradigm to sense and predict the future occupancy of the surrounding targets from the Bird's Eye View (BEV) perspective, utilizing information captured by multiple cameras mounted on the vehicle. However, modeling the underlying spatiotemporal interactions between traffic agents is a challenging part. This paper proposes a novel spatiotemporal BEV pyramid network which employs Swin Transformer to extract BEV features transformed by images and predict across multiple scales. This network is designed to preserve spatial features at low resolution and capture semantic features embedded at high resolution. In addition, a Feature Alignment Module (FAM) is introduced to aggregate information at multiple scales and reduce mispredictions caused by feature misalignment. Through validation on the nuScenes dataset, the proposed method improves on the compared previous approaches in accuracy and demonstrates an enhancement in predicting the occupancy of various targets in the BEV.
Wu et al. (Thu,) studied this question.