Although convolutional neural networks (CNNs) have achieved remarkable success in image classification tasks, their inherent limitation of fixed receptive fields restricts their ability to model long-range semantic dependencies. To address this challenge, we propose a novel network architecture, Adaptive Multi-Scale Residual Attention Network (AMSRA-Net), which integrates multi-scale local features with global self-attention mechanisms. AMSRA-Net is composed of four cascaded hierarchical multimodal residual attention blocks (HMRABs), each incorporating a multi-scale feature decoupler (MSFD) and a lightweight gated self-attention engine (LGSA-Engine). The multi-scale feature decoupler employs a channel-splitting strategy to enable parallel extraction of features at different granularities. Building upon this, the gated self-attention engine establishes long-range dependencies across spatial locations via nonlinear transformations, dynamically suppressing redundant background information while enhancing critical semantic features. This results in a deeply synergistic mechanism that combines cross-scale feature interaction with dynamic feature calibration.Experiments conducted on the CIFAR-10 dataset demonstrate that AMSRA-Net achieves a classification accuracy of 95.89%, surpassing baseline models such as ResNet-18 (95.55%) and Compact Convolutional Transformers (CCT, 95.04%), while maintaining lower model complexity. Ablation studies further reveal significant performance drops when removing the gated self-attention engine (down to 89.25%) or degrading the multi-scale feature decoupler to single-scale convolution (down to 88.80%), validating the effectiveness of the proposed dual mechanism of “feature decoupling and dynamic fusion.” This study highlights the efficacy of combining self-attention with multi-scale convolutions and offers a new paradigm for integrating CNNs with global attention mechanisms.
Jiang et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: