Abstract Emotion analysis is a critical research domain focused on detecting the emotional states of individuals or communities across multiple data modalities, including text, images, and audio. While substantial progress has been made in unimodal (text-based) sentiment analysis, real-world scenarios often involve multimodal data, making integrated approaches essential for capturing contextual richness and improving predictive accuracy. This study introduces a hybrid deep learning model that combines text and visual features through an intermediate fusion mechanism and multi-task learning framework. Textual inputs are processed using RoBERTa and BiGRU layers, while visual inputs are analyzed through ViT and ResNet50 architectures enhanced by the Convolutional Block Attention Module (CBAM). The fused multimodal representations enable simultaneous and more robust emotion classification. Experimental results on the MVSA dataset demonstrate the superior performance of the proposed model, achieving 96.02% accuracy, 95.51% precision, 94.07% recall, and 94.73% F1-score, outperforming several state-of-the-art multimodal benchmarks. These findings underscore the model’s methodological contributions and its strong potential for advancing the field of multimodal emotion analysis in both academic research and real-world applications.
Building similarity graph...
Analyzing shared references across papers
Loading...
Fatih Kaya
Yunus Emre Karaca
Serpil Aslan
The Visual Computer
Fırat University
Turgut Özal University
Building similarity graph...
Analyzing shared references across papers
Loading...
Kaya et al. (Wed,) studied this question.
www.synapsesocial.com/papers/69e07e582f7e8953b7cbf50c — DOI: https://doi.org/10.1007/s00371-026-04475-1