Abstract Bird’s-Eye-View (BEV) semantic segmentation is critical for environmental perception in autonomous driving. Surround-view fisheye camera systems are increasingly adopted to enlarge the perception range and eliminate blind spots. However, severe geometric distortions and frequent ego-motion make accurate spatio-temporal feature alignment across multiple views and timestamps challenging. Such misalignment often leads to semantic inconsistency and notable drops in BEV segmentation accuracy. Moreover, most existing methods overlook these alignment errors and apply semantic supervision only at the final output, resulting in suboptimal intermediate BEV representations. To address these challenges, we propose STSyn-BEV, a Spatio-Temporal Synchronized BEV segmentation framework for surround-view fisheye cameras. It comprises three key components: a Pose-Sync (pose-synchronized) encoder, a semantic consistency supervision module, and a stage-wise supervision decoder with heterogeneous pathways. First, the Pose-Sync encoder explicitly transforms multi-view fisheye features from previous poses and timestamps into a unified BEV space via geometric transformation, substantially improving geometric consistency and temporal alignment. Second, the semantic consistency supervision module applies region-level contrastive learning to aggregated BEV features, enhancing semantic discrimination particularly for long-tailed categories. Third, the deep supervised decoder employs heterogeneous pathways—attention-based for global semantic reasoning and convolution-based for fine-grained structural refinement—guided by stage-wise supervision, enabling improved BEV feature decoding without additional inference cost. Extensive experiments on the FB-SSEM dataset demonstrate that STSyn-BEV surpasses state-of-the-art fisheye image-based BEV segmentation methods, notably achieving a 6.25% mIoU improvement over the strongest fisheye-specific baseline.
Liu et al. (Mon,) studied this question.