Multimodal learning is an approach that leverages data from multiple sensory modalities or interaction channels to enhance the learning process. By integrating diverse modalities, this method improves a model's ability to perceive and understand complex information, enabling effective cross-modal interaction and fusion. In this paper, we propose a multimodal emotion recognition model built from scratch. We investigate four distinct fusion strategies to integrate emotional information from text, speech, and visual modalities. Through comprehensive evaluation, we demonstrate that the fusion strategy incorporating a multi-head cross-attention mechanism yields superior performance compared to other approaches.
Liuwenjie et al. (Fri,) studied this question.