Semantic segmentation plays a pivotal role in autonomous driving, enabling pixel-level understanding of road scenes. Although transformer-based models such as SegFormer have shown exceptional performance on large datasets, their generalization to smaller and geographically diverse datasets remains underexplored. In this work, we analyze the scalability and transferability of SegFormer variants (B3, B4, B5) using CamVid as the base dataset. We perform cross-dataset transfer learning to KITTI and IDD, evaluate class-level performance, and explore explainable AI via confidence heatmaps. Our findings show that SegFormer-B5 achieves the highest accuracy (82.4% mIoU) on CamVid, while transfer learning from CamVid improves mIoU on KITTI by 2.57% and enhances class-specific predictions in IDD by over 70%. These results highlight the practical potential of SegFormer in real-world segmentation systems and the interpretability benefits of confidence-based visual analysis.
Hatkar et al. (Sun,) studied this question.