What question did this study set out to answer?

This research aims to evaluate the effectiveness of CNNs and Transformers in classifying satellite images.

March 25, 2026Open Access

Deep learning paradigms in remote sensing with a comparative evaluation of CNNs and transformers for satellite imagery

Key Points

This research aims to evaluate the effectiveness of CNNs and Transformers in classifying satellite images.
Compared six CNN models and one Transformer model
Used UC Merced Land Use dataset for evaluation
Trained models under identical conditions using NVIDIA A100 GPU
Standardized preprocessing applied to the dataset
Analyzed accuracy and computational efficiency trade-offs
ResNet101 achieved 96.33% accuracy with 1m18s training time
Google ViT reached 98.57% accuracy with 9m28s training time
ResNet101 performed competitively while using less than 1/7th the training time of Google ViT

Abstract

Satellite Image Classification is a process in remote sensing that aims at classifying pixels of satellite imagery into usable information of land use and land cover. This information is not only limited to several types of land covers and specific features, but it is an essential tool for pattern recognition and storing land information. Traditionally, convolutional neural networks (CNNs) have been popularly used in this domain due to their feature-extraction capabilities on grid-like data. Recent advances in the transformer model have also introduced a new way of performing tasks and working with the labelled data. This study compares six CNNs models (CNN, InceptionV3, VGG16, Xception, ResNet50, ResNet101) and a Transformer model (Google ViT-Base-Patch16-224-in21k) on the UC Merced Land Use dataset. All models were trained and evaluated under identical computational conditions using a single NVIDIA A100 GPU with fixed hyperparameters and standardized pre-processing of the same dataset. The objective was to determine the effectiveness of architectures with a trade-off between accuracy and computational efficiency. Evaluation runs showed that ResNet101 (96.33% accuracy with 1m18sec of training time) demonstrated a competitive performance in producing results similar to Google VIT (98.57% accuracy with 9 m 28 s of training time) while taking less than 1/7th of the Transformer’s training time.

Bookmark

View Full Paper

Bookmark

View Full Paper

Deep learning paradigms in remote sensing with a comparative evaluation of CNNs and transformers for satellite imagery

Key Points

Abstract

Cite This Study