Integrating complementary attention mechanisms into standard Convolutional Neural Networks (CNNs) is a promising strategy for improving feature discrimination without substantial computational overhead. This paper presents a controlled empirical study of a hybrid attention module that combines Squeeze-and-Excitation Networks (SENet) and the Convolutional Block Attention Module (CBAM) through an adaptive element-wise summation with a learnable weighting parameter α and a residual connection. This work contributes a systematic and statistically rigorous evaluation of attention fusion across four CNN backbones (ResNet18, VGG16, AlexNet, and SqueezeNet) on the CIFAR-10 benchmark at 32×32 resolution. All models were trained from scratch under a deliberately conservative protocol (50 epochs, no pretrained weights, standard augmentation) to isolate the incremental effect of attention mechanisms under controlled conditions. Under this protocol, the hybrid SENet+CBAM configuration achieves statistically significant accuracy improvements over the corresponding baselines (p<0.001, 5-fold cross-validation): ResNet18 improves from 77.93% to 90.71% (+12.78%), VGG16 from 55.78% to 70.17% (+14.39%), AlexNet from 62.67% to 71.82% (+9.15%), and SqueezeNet from 71.91% to 78.29% (+6.38%). These gains must be interpreted within the scope of this controlled setting. Absolute accuracy values are below fully optimized literature benchmarks. For VGG16 in particular, part of the improvement likely reflects correction of underfitting under the conservative protocol, not the full potential of the hybrid mechanism. Parameter overhead remains modest at 1.5–5.8%, and training convergence improves by 16.5% on average. The hybrid approach outperforms the best previously reported SENet+CBAM result for each architecture by an average of 2.32%. Grad-CAM visualizations and attention entropy analysis provide qualitative evidence of more concentrated spatial attention patterns under the hybrid configuration. These should be understood as proxy indicators rather than rigorous interpretability measures. Validation on higher-resolution benchmarks such as CIFAR-100, STL-10, and ImageNet subsets is a necessary next step before broader applicability can be claimed.
Mbayandjambe et al. (Mon,) studied this question.