Semantic segmentation of remote sensing images faces pronounced scale variation and complex class-wise spatial distributions, which often lead to semantic discontinuity in large regions and the loss of fine details for small objects. To address these issues, this paper proposes a U-Net–based remote sensing semantic segmentation network termed RTAS-Net (ResNet–Transformer–ASPP Segmentation Network), which enhances feature representation through a collaborative design of multi-scale context aggregation, fine-scale reinforcement, and local-to-global modeling. At the high-semantic level, the network incorporates ASPP to aggregate multi-scale contextual information and enlarge the effective receptive field, while integrating the window-based self-attention mechanism of Swin Transformer to model cross-region dependencies, thereby improving semantic consistency over large-scale areas. At high-resolution skip connections, a lightweight mini-ASPP is embedded to reinforce and pre-fuse fine-scale neighborhood information, and MobileViT is introduced to strengthen local texture and fine-grained structural representations, thus enhancing the recognition and boundary delineation of small objects. Rather than a simple stacking of modules, RTAS-Net achieves unified modeling of global semantics and local details through coordinated cross-level pathways. Experimental results on the ISPRS Potsdam, Vaihingen and LoveDA datasets demonstrate that the proposed method achieves consistent improvements in mIoU, mF1, and OA, and further provides a comprehensive analysis of parameter scale and inference efficiency, validating its effectiveness and practical applicability.
Wang et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: