High-resolution land cover semantic segmentation is challenged by strong class imbalance, spatial fragmentation of minority classes, and the presence of fine-scale textures and sensor noise that can dominate early feature learning. In addition, producing high-resolution labeled maps is time-consuming and requires expert annotation, while low-resolution maps are easier to obtain but lack spatial precision. To address these challenges, we propose MUSCLE-Net, a Multi-Scale Land Cover Network that explicitly enforces semantic consistency across spatial resolutions through deep supervision. By introducing an auxiliary low-resolution segmentation task during early decoding, the network is constrained to learn semantically meaningful regional representations before recovering fine spatial details, promoting a coarse-to-fine decoding process that mitigates overfitting to high-frequency noise. Convolutional Block Attention Modules are incorporated in the decoder to further refine spatial and channel-wise feature selection. For the DynamicEarthNet dataset, MUSCLE-Net achieves an overall accuracy of 66.48%, outperforming UNet by 1.05%, DeepLabV3 by 6.21%, and PSPNet by 7.70%. For the DFC2020 dataset, MUSCLE-Net reaches an overall accuracy of 70.10%, improving upon UNet by 2.86%, PSPNet by 4.98%, and DeepLabV3 by 6.10%, and consistently shows lower variability between runs, reflecting enhanced robustness in minority land-cover classes. • Introduces MUSCLE-Net with multi-resolution deep supervision in the decoder. • Enforces semantic consistency across scales to improve minority class learning. • Proposes an attention-enhanced decoder using CBAM without residual addition. • Re-injects auxiliary supervision features to guide early decoder representations. • Demonstrates consistent accuracy and stability gains on two benchmark datasets.
Mobsite et al. (Fri,) studied this question.