Remote sensing semantic segmentation is fundamental to applications such as land-cover mapping, urban analysis, and environmental monitoring. However, remote sensing scenes often exhibit pronounced scale variation, fragmented regions, dense small objects, and complex boundary transitions, making fine-grained prediction particularly challenging. Transformer-based architectures such as SegFormer have demonstrated a strong capability in modeling long-range context through hierarchical encoding, yet their lightweight decoders mainly rely on linear projection and feature fusion, providing limited capacity for local refinement after multi-scale aggregation. This limitation may reduce spatial precision in boundary-sensitive and small-object-rich regions. To address this issue, we propose the Post-fusion Enhanced Block (PFEB), a lightweight decoder-side refinement module inserted after multi-scale feature fusion and before pixel-wise classification. PFEB combines channel expansion, depthwise and pointwise convolutions, efficient channel attention (ECA), and residual learning to enhance local semantic refinement while largely preserving computational efficiency. Built upon SegFormer, the proposed method was evaluated on two widely used remote sensing benchmarks, i.e., LoveDA and ISPRS Vaihingen, under both Mix Transformer-B0 (MiT-B0) and Mix Transformer-B2 (MiT-B2) backbones. Experimental results show that PFEB consistently improves the SegFormer baseline across datasets and model scales. Under MiT-B2 backbone, our method achieves 53.82 ± 0.31 mean intersection over union (mIoU) on LoveDA and 74.84 ± 0.41 mIoU on ISPRS Vaihingen. Boundary- and size-aware evaluations further indicate that the gains are mainly reflected in improved semantic correctness near boundaries and in the recoverability of small objects. With only modest additional cost (approximately +0.53 M parameters and +8.7 G floating point operations (FLOPs)), PFEB provides a favorable accuracy–efficiency trade-off. These results suggest that PFEB is an effective and lightweight post-fusion refinement module for improving fine-grained remote sensing semantic segmentation.
Lian et al. (Mon,) studied this question.