What question did this study set out to answer?

The research aims to enhance emotion analysis by integrating text and visual cues through a deep learning approach.

April 16, 2026Open Access

Advancing multimodal emotion analysis: a hybrid deep learning approach with intermediate fusion and multi-task learning

Key Points

The research aims to enhance emotion analysis by integrating text and visual cues through a deep learning approach.
Developed a hybrid deep learning model combining text and visual features.
Utilized RoBERTa and BiGRU for text processing.
Employed ViT and ResNet50 architectures for visual analysis.
Implemented an intermediate fusion mechanism and multi-task learning framework.
Conducted experiments on the MVSA dataset.
Achieved 96.02% accuracy, 95.51% precision, 94.07% recall, and 94.73% F1-score.
Outperformed several existing multimodal benchmarks.
Demonstrated the effectiveness of fused multimodal representations.

Abstract

Abstract Emotion analysis is a critical research domain focused on detecting the emotional states of individuals or communities across multiple data modalities, including text, images, and audio. While substantial progress has been made in unimodal (text-based) sentiment analysis, real-world scenarios often involve multimodal data, making integrated approaches essential for capturing contextual richness and improving predictive accuracy. This study introduces a hybrid deep learning model that combines text and visual features through an intermediate fusion mechanism and multi-task learning framework. Textual inputs are processed using RoBERTa and BiGRU layers, while visual inputs are analyzed through ViT and ResNet50 architectures enhanced by the Convolutional Block Attention Module (CBAM). The fused multimodal representations enable simultaneous and more robust emotion classification. Experimental results on the MVSA dataset demonstrate the superior performance of the proposed model, achieving 96.02% accuracy, 95.51% precision, 94.07% recall, and 94.73% F1-score, outperforming several state-of-the-art multimodal benchmarks. These findings underscore the model’s methodological contributions and its strong potential for advancing the field of multimodal emotion analysis in both academic research and real-world applications.

KI fragen

Bookmark

View Full Paper

Cite This Study

Kaya et al. (Wed,) studied this question.

synapsesocial.com/papers/69e07e582f7e8953b7cbf50c https://doi.org/https://doi.org/10.1007/s00371-026-04475-1

KI fragen

Bookmark

View Full Paper