What question did this study set out to answer?

This research aims to enhance image classification by integrating hybrid attention mechanisms into CNN architectures.

May 27, 2026Open Access

Integrating Hybrid Attention Mechanisms into CNN-Based Architectures to Enhance Image Classification and Interpretability

Key Points

This research aims to enhance image classification by integrating hybrid attention mechanisms into CNN architectures.
Empirical study on four CNN backbones (ResNet18, VGG16, AlexNet, SqueezeNet) using CIFAR-10 benchmark.
Hybrid attention module combining SENet and CBAM through adaptive element-wise summation.
Training involved conservative protocols with 50 epochs, no pretrained weights, and standard augmentation.
ResNet18 accuracy improved from 77.93% to 90.71% (p<0.001, +12.78%).
VGG16 accuracy improved from 55.78% to 70.17% (p<0.001, +14.39%).
Parameter overhead modestly increased by 1.5–5.8%, with training convergence improving by 16.5% on average.

Abstract

Integrating complementary attention mechanisms into standard Convolutional Neural Networks (CNNs) is a promising strategy for improving feature discrimination without substantial computational overhead. This paper presents a controlled empirical study of a hybrid attention module that combines Squeeze-and-Excitation Networks (SENet) and the Convolutional Block Attention Module (CBAM) through an adaptive element-wise summation with a learnable weighting parameter α and a residual connection. This work contributes a systematic and statistically rigorous evaluation of attention fusion across four CNN backbones (ResNet18, VGG16, AlexNet, and SqueezeNet) on the CIFAR-10 benchmark at 32×32 resolution. All models were trained from scratch under a deliberately conservative protocol (50 epochs, no pretrained weights, standard augmentation) to isolate the incremental effect of attention mechanisms under controlled conditions. Under this protocol, the hybrid SENet+CBAM configuration achieves statistically significant accuracy improvements over the corresponding baselines (p<0.001, 5-fold cross-validation): ResNet18 improves from 77.93% to 90.71% (+12.78%), VGG16 from 55.78% to 70.17% (+14.39%), AlexNet from 62.67% to 71.82% (+9.15%), and SqueezeNet from 71.91% to 78.29% (+6.38%). These gains must be interpreted within the scope of this controlled setting. Absolute accuracy values are below fully optimized literature benchmarks. For VGG16 in particular, part of the improvement likely reflects correction of underfitting under the conservative protocol, not the full potential of the hybrid mechanism. Parameter overhead remains modest at 1.5–5.8%, and training convergence improves by 16.5% on average. The hybrid approach outperforms the best previously reported SENet+CBAM result for each architecture by an average of 2.32%. Grad-CAM visualizations and attention entropy analysis provide qualitative evidence of more concentrated spatial attention patterns under the hybrid configuration. These should be understood as proxy indicators rather than rigorous interpretability measures. Validation on higher-resolution benchmarks such as CIFAR-100, STL-10, and ImageNet subsets is a necessary next step before broader applicability can be claimed.

Integrating Hybrid Attention Mechanisms into CNN-Based Architectures to Enhance Image Classification and Interpretability

Key Points

Abstract

Cite This Study