Key points are not available for this paper at this time.
Recently, Vision Transformer (ViT)-based deep learning models have achieved remarkable performance gains in hyperspectral image classification (HSIC) due to their abilities to model long-range dependencies and extract global spatial features. However, ViT is built with a stack of Transformer blocks and faces the challenge of learning a large number of parameters when processing hyperspectral data. Besides, the inherent modeling of global correlation in Transformer ignores the effective representation of local spatial and spectral features. To address these issues, we propose a lightweight ViT network known as Groupwise Separable Convolutional Vision Transformer (GSC-ViT). Firstly, a Groupwise Separable Convolution (GSC) module, which is a combination of grouped pointwise convolution and group convolution, is designed to significantly decrease the number of convolutional kernel parameters, and effectively capture local spectral-spatial information in hyperspectral image. Secondly, a Groupwise Separable Multi-Head Self-Attention (GSSA) module is employed to substitute the conventional Multi-Head Self-Attention (MSA) in ViT, in which the Groupwise Self-Attention(GSA) provides local spatial feature extraction, and the Pointwise Self-Attention(PWSA) provides global spatial feature extraction. Thirdly, a simple pointwise layer with enhanced skip connection mechanism is employed to substitute the Multi-Layer Perceptron (MLP) layer in all Transformer blocks of ViT, so as to eliminate unnecessary nonlinear transformations and facilitate the fusion of features derived from GSC and GSSA modules. Extensive experiments on four benchmark hyperspectral datasets reveal that our GSC-ViT can achieve surprising classification performance with relatively few training samples as compared with some existing HSIC approaches. The source code is available at https://github.com/flyzzie/TGRS-GSC-VIT.
Zhao et al. (Mon,) studied this question.