December 4, 2025Open Access

SiWformer: multi-scale feature fusion via discrete wavelet transform and vision transformer

Key Points

Scene classification performance improves significantly with robust feature fusion techniques, enabling better accuracy.
Experimental results confirm enhancements in robustness using a scene classification module based on multi-scale features.
Analysis employed spatial relationships to extract crucial features across various object scales and resolutions.
Improved scene classification in complex environments supports advanced object recognition applications.

Abstract

Abstract This paper presents a novel hierarchical feature fusion framework for scale-invariant multi-object classification in complex scene recognition. Traditional deep learning models struggle to effectively capture multi-scale features, limiting their ability to classify objects under varying size and resolution conditions. To address this, we introduce the Multi-Scale Feature Fusion via Discrete Wavelet Transform and Vision Transformer (SiWformer), which integrates Discrete Wavelet Transform (DWT) with a transformer-based self-attention mechanism to extract both fine-grained and global image representations. The Multi-Scale Feature Extraction (MSFE) module decomposes images into multiple frequency bands, enhancing feature diversity and preserving spatial relationships across different resolutions. A transformer-based fusion mechanism then systematically aligns and refines these features, ensuring comprehensive representation learning. For scene classification, a Maximum Entropy-based Scene Classification module is employed, which leverages object co-occurrence relationships to enhance contextual understanding. Extensive experiments on benchmark datasets UIUC Sports and PASCAL VOC 2012, demonstrate that Wavelet-ViT significantly enhances both object and scene classification performance, achieving competitive accuracy and improved robustness over existing methods. These results validate the effectiveness of the proposed feature fusion strategy for fine-grained and context-aware scene understanding.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper