Satellite Image Classification is a process in remote sensing that aims at classifying pixels of satellite imagery into usable information of land use and land cover. This information is not only limited to several types of land covers and specific features, but it is an essential tool for pattern recognition and storing land information. Traditionally, convolutional neural networks (CNNs) have been popularly used in this domain due to their feature-extraction capabilities on grid-like data. Recent advances in the transformer model have also introduced a new way of performing tasks and working with the labelled data. This study compares six CNNs models (CNN, InceptionV3, VGG16, Xception, ResNet50, ResNet101) and a Transformer model (Google ViT-Base-Patch16-224-in21k) on the UC Merced Land Use dataset. All models were trained and evaluated under identical computational conditions using a single NVIDIA A100 GPU with fixed hyperparameters and standardized pre-processing of the same dataset. The objective was to determine the effectiveness of architectures with a trade-off between accuracy and computational efficiency. Evaluation runs showed that ResNet101 (96.33% accuracy with 1m18sec of training time) demonstrated a competitive performance in producing results similar to Google VIT (98.57% accuracy with 9 m 28 s of training time) while taking less than 1/7th of the Transformer’s training time.
Nigam et al. (Mon,) studied this question.