Abstract This paper presents a novel hierarchical feature fusion framework for scale-invariant multi-object classification in complex scene recognition. Traditional deep learning models struggle to effectively capture multi-scale features, limiting their ability to classify objects under varying size and resolution conditions. To address this, we introduce the Multi-Scale Feature Fusion via Discrete Wavelet Transform and Vision Transformer (SiWformer), which integrates Discrete Wavelet Transform (DWT) with a transformer-based self-attention mechanism to extract both fine-grained and global image representations. The Multi-Scale Feature Extraction (MSFE) module decomposes images into multiple frequency bands, enhancing feature diversity and preserving spatial relationships across different resolutions. A transformer-based fusion mechanism then systematically aligns and refines these features, ensuring comprehensive representation learning. For scene classification, a Maximum Entropy-based Scene Classification module is employed, which leverages object co-occurrence relationships to enhance contextual understanding. Extensive experiments on benchmark datasets UIUC Sports and PASCAL VOC 2012, demonstrate that Wavelet-ViT significantly enhances both object and scene classification performance, achieving competitive accuracy and improved robustness over existing methods. These results validate the effectiveness of the proposed feature fusion strategy for fine-grained and context-aware scene understanding.
Wu et al. (Mon,) studied this question.