Precise semantic segmentation of High-Resolution Remote Sensing(HRRS) images is essential for robust environmental surveillance and detailed land use mapping. Despite substantial advances in deep learning, most conventional approaches focus on the spatial domain. This focus often neglects the rich textural and structural nuances found in the frequency domain, which reduces the representation of comprehensive data. Addressing this issue, we introduce SF-Net. This network synthesizes features across spatial and frequency domains, aiming for seamless and effective integration. The core of SF-Net employs a multiscale Convolutional Grouping Fusion Module (CGFM) to extract spatial features at varying resolutions. Following this, the Haar Wavelet Transform decomposes these features into distinct low-frequency components (structure) and high-frequency components (detail). Subsequently, a Mamba-enhanced Global Spatial Feature Extraction Module (GSFEM) reinforces low-frequency semantic information with global context, while a Spatial-Frequency Fusion Module (S-FFM) applies targeted attention to sharpen high-frequency details. Experimental results on the ISPRS Vaihingen, LoveDA, and Potsdam benchmarks confirm SF-Net's superior performance, achieving state-of-the-art mean Intersection over Union (mIoU) scores of 83.12%, 53.28%, and 83.35%, respectively, validating its effectiveness and superority.
Ge et al. (Thu,) studied this question.