Key points are not available for this paper at this time.
Multimodal combines multiple modalities to create insightful conclusions or to make more precise predictions. Nowadays, the multimodal concept is used to identify human emotions precisely. This study proposes a three-stage novel M-fusHER (Multimodal fusion Human Emotion Recognition) multimodal model for human emotion recognition in real-time with the help of text, audio, and videos. In the first stage, features are extracted with the help of a convolutional neural network merged with multiplicative LSTM. In the second stage, video and audio data, text, and audio are fused in binary form. In the third stage, real-time object detection for human emotion recognition on real videos is implemented. The experimental results are obtained by fusing audio, text, and videos by considering the standard features. For object detection, a fine-tuned YOLOv6 model was used for detecting facial features and expressions from the video. The multiplicative LSTM is also used to extract and learn from the text features. Three datasets, i.e., IEMOCAP, MOSEI, and MELD are used for implementation, and the detection accuracy of the proposed model M-fusHER on IEMOCAP, MOSEI, and MELD datasets is 95.45%, 88.76%, and 95.41% approximately, which is quite encouraging.
Gupta et al. (Tue,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: