Key points are not available for this paper at this time.
Multimodal Sentiment Analysis (MSA) is an emerging research field that aims to identify the sentiment of speakers through Audio (A), Video (V), and Text (T) modalities. The major challenge is to capture joint representations that can associate and integrate information from various modalities. Most of the existing methods are prone to acquiring joint representation through the concatenation of input features. However, these methods are short of exploiting interactions fully to ensure consistency and complementarity among modalities. To solve this problem, we design a novel multimodal sentiment analysis framework named Cross-Modal Joint Representation Transformer (CMJRT), which exploits hierarchical interactions among modalities by passing joint representations from bimodality to unimodality. Specifically, we adopt cyclic translation to obtain joint representations of bimodality, where one modality is translated to the other modality forward and backward by encoder-decoders. The translation process ensures consistency between modalities. In addition, to explore complementarity among modalities, the cross-modal transformer is used to reinforce each unimodality with common information from bimodality. Extensive experiments on CMU-MOSI and CMU-MOSEI datasets demonstrate that our proposed method outperforms existing approaches.
Building similarity graph...
Analyzing shared references across papers
Loading...
Meng Xu
Civil Aviation University of China
Feifei Liang
First Automotive Works (China)
Xiangyi Su
Civil Aviation University of China
IEEE Access
Civil Aviation University of China
Building similarity graph...
Analyzing shared references across papers
Loading...
Xu et al. (Sat,) studied this question.
synapsesocial.com/papers/6a20d747920f77b2c049cce5 — DOI: https://doi.org/10.1109/access.2022.3219200