What question did this study set out to answer?

This research aims to address the challenges of uncertain missing modalities in multimodal sentiment analysis.

May 9, 2026Open Access

Sequential translation-based multimodal sentiment analysis under uncertain missing modalities

YHYan HaiNorth China University of Water Resources and Electric Power SLShanqi LuNorth China University of Water Resources and Electric Power ZLZhizhong LiuYantai University

Key Points

This research aims to address the challenges of uncertain missing modalities in multimodal sentiment analysis.
Introduced a sequential translation-based multimodal sentiment analysis model (STMSA).
Implemented a text-centric bidirectional translation mechanism for multimodal interactions.
Used a low-complexity encoder-decoder architecture for joint representation fitting.
The proposed STMSA model outperformed 10 state-of-the-art baseline models on CMU-MOSI dataset.
Demonstrated improved sentiment classification accuracy across both CMU-MOSI and IEMOCAP datasets.

Abstract

Multimodal Sentiment Analysis (MSA) aims to fuse information from multiple modalities to achieve precise sentiment classification. Recently, the issue of uncertain missing modalities has become one of the new challenges in MSA. Previous studies have attempted to solve this issue by building information interactions on modality pairs consisting of two modalities. However, existing methods typically rely on interactions between paired modalities to compensate for missing information. Such representations struggle to accurately reconstruct true cross-modal semantics due to the absence of guidance from a third modality. Additionally, existing approaches have neglected the effective utilization of text modality and the complexity of the models is relatively high. To tackle the above issues, we propose a sequential translation-based MSA model (STMSA). This model incorporates two key designs. First, the text-centric bidirectional translation mechanism leverages the dominant role of the text modality in affective tasks to sequentially establish bidirectional mappings with the audio and video modalities. This mechanism fully explores the deep connections among the three modalities through semantic guidance from text, enabling cross-modal representations that more closely align with real affective distributions. Second, the low-complexity non-modal completion architecture performs distributed fitting on joint representations in a shared space using only an encoder-decoder, thereby avoiding complex missing-modality generation processes. Extensive experiments were conducted on two public datasets, CMU-MOSI and IEMOCAP, demonstrating that the proposed model outperforms 10 state-of-the-art baseline models.

Ask AI

Helpful

Bookmark

View Full Paper