Semantic segmentation for autonomous driving demands balancing high-fidelity perception with real-time latency. While Transformers achieve state-of-the-art results, their quadratic complexity bottlenecks high-resolution processing. State Space Models (SSMs) like Mamba offer linear complexity but often suffer from local detail loss and inefficient scanning strategies. We introduce AutoMamba, a tailored Hybrid-SSM architecture. We propose a Hybrid-SSM block incorporating Depthwise Convolutions to inject local spatial priors and a Stage-Adaptive Mixed-Scanning strategy. This strategy prioritizes horizontal context in early stages for road layouts while only activating vertical scanning in deep layers to preserve anisotropic structures like poles. Furthermore, we reveal that unlike Transformers, Mamba architectures require Auxiliary Supervision and Online Hard Example Mining (OHEM) to address “long-tail forgetting.” Experiments on Cityscapes and BDD100K under a training-from-scratch setting demonstrate AutoMamba’s superiority. Notably, AutoMamba-B0 achieves 67.79% mIoU on Cityscapes with 31.3% fewer FLOPs than SegFormer-B0. Moreover, while the larger SegFormer-B2 fails with Out-Of-Memory errors at 2048×2048 resolution, AutoMamba-B2 scales efficiently, validating its linear complexity advantage for next-generation perception systems.
Sun et al. (Fri,) studied this question.