Transformer-based deep learning techniques have recently shown outstanding potential in remote sensing scene classification (RSSC), benefiting from their ability to capture global semantic relationships and contextual dependencies. However, effectively utilizing the raw image and global semantic information while simultaneously taking into account detailed features and multi-scale spatial relationships remains a major challenge. Therefore, this paper proposes a novel FG-Swin KANsformer model that integrates frequency domain and gradient prior information from raw images with the Kolmogorov–Arnold Network (KAN) to enhance nonlinear feature modeling. The FG-Swin KANsformer consists of three key components: the Discrete Cosine Transform (DCT) module, the gradient-spatial feature extraction (GSFE) module, and the Swin Transformer module integrated with KAN. In the feature embedding phase, the DCT module extracts frequency domain features, while the GSFE module uses multi-scale convolutions and Sobel operators to extract spatial structures and gradient information at different scales, thereby enhancing the utilization of the original image’s frequency domain and gradient prior information. In the Swin Transformer feature modeling phase, the conventional multilayer perceptron (MLP) in Swin Transformer Blocks is replaced by KAN, which decomposes complex multivariate functions into univariate compositions, thereby improving nonlinear representation capacity and enhancing feature discrimination. The thorough experiments on three distinct public remote sensing (RS) datasets demonstrate that FG-Swin KANsformer exhibits outstanding performance.
Zhu et al. (Thu,) studied this question.