What question did this study set out to answer?

The aim is to improve BEV semantic segmentation by addressing spatio-temporal alignment errors in surround-view fisheye cameras.

March 12, 2026Open Access

STSyn-BEV: BEV segmentation from surround-view fisheye cameras via spatio-temporal synchronization

Key Points

The aim is to improve BEV semantic segmentation by addressing spatio-temporal alignment errors in surround-view fisheye cameras.
Developed STSyn-BEV framework with Pose-Sync encoder, semantic consistency supervision module, and specialized decoder.
Apply geometric transformation for aligning multi-view fisheye features in unified BEV space.
Utilize region-level contrastive learning to boost semantic discrimination for underrepresented categories.
Implemented attention-based and convolution-based decoding pathways for refined features.
Achieved a notable 6.25% improvement in mean Intersection over Union (mIoU) over the strongest baseline.
Demonstrated enhanced geometric consistency and temporal alignment in segmentation outputs.
Outperformed existing fisheye image-based BEV segmentation methods in extensive experiments.

Abstract

Abstract Bird’s-Eye-View (BEV) semantic segmentation is critical for environmental perception in autonomous driving. Surround-view fisheye camera systems are increasingly adopted to enlarge the perception range and eliminate blind spots. However, severe geometric distortions and frequent ego-motion make accurate spatio-temporal feature alignment across multiple views and timestamps challenging. Such misalignment often leads to semantic inconsistency and notable drops in BEV segmentation accuracy. Moreover, most existing methods overlook these alignment errors and apply semantic supervision only at the final output, resulting in suboptimal intermediate BEV representations. To address these challenges, we propose STSyn-BEV, a Spatio-Temporal Synchronized BEV segmentation framework for surround-view fisheye cameras. It comprises three key components: a Pose-Sync (pose-synchronized) encoder, a semantic consistency supervision module, and a stage-wise supervision decoder with heterogeneous pathways. First, the Pose-Sync encoder explicitly transforms multi-view fisheye features from previous poses and timestamps into a unified BEV space via geometric transformation, substantially improving geometric consistency and temporal alignment. Second, the semantic consistency supervision module applies region-level contrastive learning to aggregated BEV features, enhancing semantic discrimination particularly for long-tailed categories. Third, the deep supervised decoder employs heterogeneous pathways—attention-based for global semantic reasoning and convolution-based for fine-grained structural refinement—guided by stage-wise supervision, enabling improved BEV feature decoding without additional inference cost. Extensive experiments on the FB-SSEM dataset demonstrate that STSyn-BEV surpasses state-of-the-art fisheye image-based BEV segmentation methods, notably achieving a 6.25% mIoU improvement over the strongest fisheye-specific baseline.

Bookmark

View Full Paper

Bookmark

View Full Paper

STSyn-BEV: BEV segmentation from surround-view fisheye cameras via spatio-temporal synchronization

Key Points

Abstract

Cite This Study