Key points are not available for this paper at this time.
Speech is a crucial tool for communication and expressing emotions. Analyzing emotions from speech signals has been a focus of signal processing research for decades. However, designing emotion recognition models presents challenges due to their reliance on speaker-specific characteristics like language, accent, culture, age, and gender. Hence, it's advantageous to create speaker-independent models. Here, we propose a speaker-independent emotion recognition model using novel multi-level audio features and a co-attention module. The model combines Complex MFCCs, spectrograms, and the original speech signal as inputs to three networks: Bi-LSTM, Swin Transformer, and Wav2vec2.0. The representations out of these networks are combined with a proposed feature embedding optimization mechanism for Wav2vec2.0. The fused features are employed for emotion prediction, and a non-linear SVM kernel classifier handles emotion classification. Experiments on the IEMOCAP dataset demonstrate promising results, achieving up to a 2.28% improvement over prior work in emotion recognition accuracy.
Saadati et al. (Wed,) studied this question.