Classification of artistic styles gets importance in computer vision and is used in the preservation of cultural heritage, the retrieval of works of art, and recommendation systems. While Convolutional Neural Networks (CNNs) excel in local texture analysis, they often fail in considering long-range dependencies; on the contrary, Transformer-based models are good at global context but lack the ability to extract fine-grained features. We have developed ArtFusionNet in the framework, which, by means of an Adaptive Fusion Module (AFM), realizes synergy between CNN multiscale feature extraction and Transformer global modeling. This approach combines dilated convolutions and pyramid pooling to extract hierarchical CNN features which are then tokenized and subjected to multi-head self-attention for the global representation. The AFM implements learnable weighting for optimal fusion of output, which has been evaluated in Fallahₐrtistdataset, comprising thousands of artworks across multiple styles, WikiArt, BAM!, and Painting-91 where our model achieved a state-of-the-art accuracy of 99. 00%, exceeding stand-alone CNNs, Transformers, and previous hybrid architectures. Our ablation studies demonstrated that CNN and Transformer components mutually reinforce each other, while a sensitivity analysis provided us the hyperparameter values that work best for this model. Statistically significant tests (t-test, ANOVA, p < 0. 05) confirm robustness; precision-recall curves and confusion matrices highlight balanced performance with very few misclassifications. The present framework makes a big advancement in artistic style classification through linking local and global feature modeling. Future research will be aimed at enhancing the efficiency of the model by compressing it using pruning and knowledge distillation to be able to deploy it in real-time. Also, we will examine self-supervised learning to improve generalization to a wide range of artistic styles with little labeled data. We also want to include multi-modal frameworks by combining visual and textual information to improve the accuracy of classification and applications like personalized art recommendations.
Liang et al. (Thu,) studied this question.