Vision Transformers (ViTs), owing to their strong capability in modeling global contextual dependencies, have been widely adopted in hyperspectral image unmixing (HU). However, standard ViTs process images by partitioning them into non-overlapping patches, which disrupts spatial continuity at the pixel level and neglects the fine-grained structural relationships among pixels within local regions. Consequently, effectively capturing the detailed spatial–spectral features required for accurate unmixing remains challenging. Furthermore, the high computational complexity of global self-attention and its sensitivity to noise limit the applicability of conventional Transformers to HU. To address these issues, we propose a spatial–spectral similarity guided Transformer-in-Transformer (SSTNT) framework. The proposed network adopts a modified TNT architecture, in which the inner Transformer employs a linear self-attention (LSA) mechanism to efficiently exploit pixel-level local features within sliding windows, while the outer Transformer preserves global attention to aggregate contextual information, thereby forming a cooperative local–global optimization scheme. Furthermore, a lightweight spatial–spectral similarity module is introduced to enhance the modeling of neighborhood structures. Finally, spectral reconstruction is achieved through a trainable endmember decoder and a normalized abundance estimation module. Extensive experiments conducted on both synthetic and real hyperspectral datasets demonstrate the effectiveness and robustness of the proposed method.
Cui et al. (Fri,) studied this question.