The fusion of multi-source remote sensing data has emerged as a critical technical approach to enhancing the accuracy of ground object classification. The synergistic integration of hyperspectral images and light detection and ranging data can significantly improve the capability of identifying ground objects in complex environments. However, modeling the correlation between their heterogeneous features remains a key technical challenge. Conventional methods often result in feature redundancy due to simple concatenation, making it difficult to effectively exploit the complementary information across modalities. To address this issue, this paper proposes a cross-modal cross-attention Transformer network for the classification of hyperspectral images combined with light detection and ranging data. The proposed method aims to effectively integrate the complementary characteristics of hyperspectral images and light detection and ranging data. Specifically, it employs a two-level pyramid architecture to extract multi-scale features at the shallow level, thereby overcoming the redundancy limitations associated with traditional stacking-based fusion approaches. Furthermore, an innovative cross-attention mechanism is introduced within the Transformer encoder to dynamically capture the semantic correlations between the spectral features of hyperspectral images and the elevation information from light detection and ranging data. This enables effective feature alignment and enhancement through the adaptive allocation of attention weights. Extensive experiments conducted on three publicly available datasets demonstrate that the proposed method exhibits notable advantages over existing state-of-the-art approaches.
Guo et al. (Fri,) studied this question.