Hyperspectral images, characterized by rich spectral information, enable precise pixel-level classification and are thus widely employed in remote sensing applications. Although convolutional neural networks (CNNs) have demonstrated effectiveness in hyperspectral image processing, their limited receptive fields constrain their capacity to capture long-range dependencies. Transformers excel at modeling long-range features for hyperspectral image classification (HSIC). Yet, they often overlook effective representation of local spectral–spatial characteristics while incurring computational redundancy from numerous classification-irrelevant tokens. To address these challenges, we propose EDTST, a state-of-the-art Vision Transformer architecture specifically designed for efficient hyperspectral image classification. The model utilizes a large-kernel 3D convolution block to extract deep spectral–spatial features. A 2D convolution block further refines these features, followed by a novel attention mechanism with dynamic token pruning that substantially reduces the computational load by focusing on the most pertinent features. The process concludes with an adaptive average pooling layer and a fully connected layer for classification. Extensive experiments on four standard hyperspectral datasets demonstrate that EDTST achieves the highest classification accuracy, with a notable 3% improvement in overall accuracy on the WHU-Hi-HanChuan dataset, while requiring the shortest training and inference time among all compared state-of-the-art models from the past three years. These results validate the efficacy of our approach in achieving superior performance with markedly improved computational efficiency.
Hu et al. (Sun,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: