November 18, 2025Open Access

A multimodal fusion model for real-time environment emotion recognition using audio-visual-textual features

Key Points

Key points are not available for this paper at this time.

Abstract

Multimodal combines multiple modalities to create insightful conclusions or to make more precise predictions. Nowadays, the multimodal concept is used to identify human emotions precisely. This study proposes a three-stage novel M-fusHER (Multimodal fusion Human Emotion Recognition) multimodal model for human emotion recognition in real-time with the help of text, audio, and videos. In the first stage, features are extracted with the help of a convolutional neural network merged with multiplicative LSTM. In the second stage, video and audio data, text, and audio are fused in binary form. In the third stage, real-time object detection for human emotion recognition on real videos is implemented. The experimental results are obtained by fusing audio, text, and videos by considering the standard features. For object detection, a fine-tuned YOLOv6 model was used for detecting facial features and expressions from the video. The multiplicative LSTM is also used to extract and learn from the text features. Three datasets, i.e., IEMOCAP, MOSEI, and MELD are used for implementation, and the detection accuracy of the proposed model M-fusHER on IEMOCAP, MOSEI, and MELD datasets is 95.45%, 88.76%, and 95.41% approximately, which is quite encouraging.

Mark Helpful

Bookmark

Relay

View Full Paper