May 7, 2026Open Access

Enhancing emotion recognition through three modalities fusion and contrastive learning with the Mamba architecture

Key Points

To develop an efficient multimodal fusion and contrastive learning method for emotion recognition.
Leveraged the Mamba architecture for distinct feature extraction from multimodal data.
Introduced the Fusion Mamba block for dual input handling and comprehensive information fusion.
Optimized contrastive learning with auxiliary classification for enhanced performance.
Achieved state-of-the-art performance with 32.2% faster inference.
Validated effectiveness through experiments on three public datasets.
Provided both quantitative and qualitative evaluations.

Abstract

Multimodal emotion recognition extracts emotional information from sequential multimodal data and classifies emotion tendencies. Current multimodal fusion methods based on artificial intelligence mainly rely on Transformers to extract features and integrate different data types. Despite their strength in learning global information, Transformers face challenges due to their quadratic complexity. Recent advances in state space models, especially the Mamba architecture, provide a promising solution by achieving global awareness with linear complexity. However, the potential of Mamba for information fusion in multimodal domains remains largely untapped. This paper introduces an innovative and efficient multimodal fusion and contrastive learning method called Fusion Mamba and Contrastive Learning, which leverages artificial intelligence for implementation and application in emotion recognition tasks. To effectively extract distinct features, the unimodal Mamba architecture is used to enhance unimodal representations. For comprehensive information fusion, the Mamba block is extended to handle dual inputs, forming a novel module called the Fusion Mamba block. This forms the basis for an architecture that incorporates three different modalities and three branches. Additionally, contrastive learning and interaction-level auxiliary classification constraints are jointly optimized to boost performance. The effectiveness of our approach, which highlights the application of artificial intelligence, is validated through experiments on three public datasets. Both quantitative and qualitative evaluations show that our method achieves state-of-the-art performance with 32.2% faster inference. Extensive ablation studies further confirm the effectiveness of the Mamba architecture in multimodal tasks.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Qianjun Shuai

Xiaohao Chen

Feng Hu

Journals

Complex & Intelligent Systems

Actions

Institutions

University of Sunderland

Communication University of China

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Enhancing emotion recognition through three modalities fusion and contrastive learning with the Mamba architecture

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study