August 15, 2025Open Access

Are vision transformers replacing convolutional neural networks in scene interpretation?: A review

Key Points

Vision transformers demonstrate superior scene interpretation performance compared to convolutional neural networks, highlighting a potential shift in methodologies.
The review analyzed 142 peer-reviewed studies from 2017 to 2024, comparing the efficacy of CNNs and vision transformers on multiple public datasets.
A comprehensive assessment of architectural foundations, training strategies, and performance metrics for CNN and ViT models was conducted.
Future research directions are explored, indicating opportunities for advancing vision transformer model designs in scene recognition tasks.

Abstract

Visual scene interpretation is a significant and daunting process of observing, exploring, and elaborating dynamic scenes. It provides reliable and safe communication with the natural world and environmental affairs. Cutting-edge computer vision technology plays a key role in enabling communication that allows individuals to understand visual scenes in the same way they do. Technical advancements in computer vision have been overwhelmingly successful, primarily driven by the harnessing of deep learning algorithms. Recently, Vision Transformers (ViTs) have emerged as a viable alternative to conventional neural networks. Powered by an attention mechanism, ViT-based approaches have demonstrated competitive or superior performance to CNNs in several benchmark scene interpretation tasks. This research carries out a detailed and inclusive exploration of the scene recognition approaches using Convolutional Neural Networks (CNN) and ViTs. This article aims to present a comprehensive study of the existing advanced research views for CNNs and ViTs in scene recognition. This review presents a comprehensive and methodical analysis of recent developments in CNN and ViT-based models for scene recognition. A total of 142 peer-reviewed studies published between 2017 and 2024 were reviewed based on defined inclusion criteria, focusing on works that evaluate these models on public datasets. The review begins with an overview of the architectural foundations and functional variations of CNNs used for scene interpretation. Next, it explores the structure of ViTs, including their multi-head self-attention mechanisms, and assesses state-of-the-art ViT variants with respect to design innovations, training strategies, and performance metrics. As a final point, we discuss some possible future research directions for designing ViT models. Hence, this study can be employed as a reference for scholars and experts to develop new ViT architectures in this domain.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper