Recognising emotions in artworks is essential for digital galleries, personalised art recommendations, and art education.However, this task is challenging due to the abstract nature of images and subjective viewer interpretations, and existing methods often inadequately integrate visual content with textual descriptions.To address this issue, this paper proposes a multimodal adaptive emotion recognition network grounded in appraisal theory, featuring a gated adaptive fusion module that dynamically balances image and text contributions.An emotion-aware contrastive learning pre-training strategy is introduced to align cross-modal features.Experiments on the ArtEmis dataset show our method achieves 71.3% accuracy, surpassing state-of-the-art baselines by 2.4 percentage points.Ablation and case studies confirm the effectiveness and interpretability of each component.This work offers a promising solution for emotion understanding in art with demonstrable practical potential in controlled settings.
Xu Dangui (Thu,) studied this question.