What question did this study set out to answer?

This research aims to enhance multimodal sentiment analysis by developing an effective information fusion method for diverse data types.

March 29, 2026Open Access

Multimodal emotion feature extraction and information fusion methods for video content

Key Points

This research aims to enhance multimodal sentiment analysis by developing an effective information fusion method for diverse data types.
Developed a transformer-based model named MERCAT.
Implemented a cross-modal self-attention mechanism.
Conducted extensive experiments across multiple benchmarks for validation.
Achieved superior performance in emotion classification with higher accuracy and F1-score.
Significantly reduced error in emotion intensity prediction and improved correlation.

Abstract

Multimodal sentiment analysis remains challenging due to the difficulty of fusing heterogeneous data like facial expressions, speech, and pose.Unlike unimodal analysis, it is often hindered by problems of information redundancy, heterogeneity, and complex temporal dynamics.This paper proposes MERCAT, a Transformer-based model featuring a cross-modal self-attention mechanism to capture deep correlations between different modalities.This design enables highly efficient and context-aware inter-modal fusion.Extensive experiments on multiple benchmarks show that MERCAT achieves excellent performance.It notably excels in emotion classification, significantly improving accuracy and F1-score over strong baselines, and in emotion intensity prediction, where it substantially reduces error and improves correlation.The study conclusively verifies the efficacy of the cross-modal attention mechanism for information fusion, providing a robust and effective solution for advancing multimodal sentiment analysis.

KI fragen

Bookmark

View Full Paper