Image semantic segmentation is essential in autonomous driving, medical imaging, and remote sensing. While convolutional neural networks (CNNs) excel at local feature extraction and spatial structure modeling, their limited receptive fields restrict the capture of long-range dependencies and global semantic consistency. Transformers provide strong global modeling through self-attention but often lack local inductive bias and show weaker generalization on small datasets. To address these limitations, this paper proposes a Multi-Scale Context-aware Network (MSC-Net) for image semantic segmentation. Under an encoder–decoder framework, MSC-Net combines a convolutional backbone with a Multi-Scale Self-Attention module to integrate the complementary strengths of CNNs and attention mechanisms. The backbone extracts local texture and structural information and can adopt architectures such as MobileNet, Xception, DRN, and ResNet, while the attention module captures long-range dependencies and multi-scale contextual information. This design improves cross-layer feature collaboration, multi-scale feature fusion, and boundary quality while maintaining computational efficiency. Experimental results show that MSC-Net achieves 38.8% mIoU and 98.4% ACC under comparable computational settings. Compared with SegFormer and DeepLabV3+, the model improves mIoU by approximately +3.0 and +3.3 percentage points, respectively, while reducing FLOPs and parameter size.
Yang et al. (Tue,) studied this question.