Convolutional neural networks have made significant progress in multimodal remote sensing image classification, but traditional convolutional neural networks are limited by fixed-size convolutional kernels, which are unable to effectively model and adequately extract contextual information; hyperspectral imagery and LiDAR data have comparatively large information differences, which do not allow for effective information interaction and fusion. Based on this, this paper proposes a multimodal dual fusion network (CSTC) based on the Vision Transformer for the collaborative classification of HSI and LiDAR data. The model is designed through a two-branch architecture: the HSI branch extracts spectral–spatial features by dimensionality reduction using principal component analysis and inputs them into the cross-connectivity feature fusion module; the LiDAR branch mines spatial elevation features through the stacked MobileNetV2 module. The features of the two branches are encoded by a Transformer, and the modal interaction fusion is realized by the cross-attention module for the first time. Then, the features are spliced and input into the secondary Transformer for deep cross-modal fusion, and finally, the classification is completed by the multilayer perceptron. Experiments show that the CSTC model achieves overall classification accuracies of 92.32%, 99.81%, 97.90%, and 99.37% on the publicly available MUUFL dataset, Trento dataset, Augsburg dataset, and Houston2013 dataset, respectively, which is superior to the latest HSI–LiDAR separate classification algorithms. The ablation experiments and model performance evaluation experiments further show that the proposed CSTC model achieves excellent results in terms of robustness, adaptability, and parameter scale.
Mei et al. (Thu,) studied this question.