Key points are not available for this paper at this time.
With the recent developments of transformer-based architecture in the image classification domain, the initial Vision Transformer (ViT) model has shown promising results compared to traditional CNN models. Inspired by this, this article reports on the efficacy of transformer-based models on remote sensing images for land cover classification. Our approach applies a variation of the vision transformer named the Swin (Shifted Window) Transformer model for analysis. This is a hierarchical transformer model that computes the representation with shifted windows. Results include an extensive study on the performance of this transformer for three different remote sensing datasets: EuroSat, NWPU-RESISC45, and AID. Findings indicate that the Swin architecture outperforms current state-of-the-art approaches for accurately classifying remote sensing images. Comparative analyses provide insights on the specific margin of improvement and an understanding of the prospect these transformer architectures have for improving image classification tasks of this type.
Jannat et al. (Sat,) studied this question.