What question did this study set out to answer?

This research aims to explore the progression of deep learning methods in speech emotion recognition (SER), focusing on both basic and complex emotions.

June 14, 2026Open Access

Speech emotion recognition using deep learning: from basic to complex emotions in unimodal and multimodal frameworks

Key Points

This research aims to explore the progression of deep learning methods in speech emotion recognition (SER), focusing on both basic and complex emotions.
Conducted a systematic review following PRISMA guidelines of SER approaches from 2019 to present.
Analyzed models, datasets, and architectures including CNNs, RNNs, and Transformers.
Assessed the efficacy of unimodal versus multimodal frameworks in recognizing emotions.
Deep learning approaches significantly improved SER performance, especially hybrid models.
Unimodal models struggled with complex emotions due to overlapping acoustic features.
Multimodal models consistently outperformed unimodal models by utilizing complementary data.

Abstract

Abstract Speech Emotion Recognition (SER) is an advanced technology for developing intuitive and empathetic human-computer interfaces (HCI). While traditional SER systems have achievement a certain degree of succeed in recognising basic emotions from acted speech in a closed environment, real-world applications necessitate the recognition of more complex emotions. This paper presents a systematic review of deep learning approaches in SER from 2019 to the present, following the PRISMA guidelines, with a specific focus on the bridge between basic and complex SER within unimodal (audio-only) and multimodal frameworks. Analysis was done on the landscape of emotion models, datasets, and state-of-the-art (SOTA) model architectures, including CNNs, RNNs, Transformers, and their hybrids. The results reveal that deep learning has improved performance; the following hybrid models improved considerably; however, unimodal models still struggle with the subtle and often overlapping acoustic features of complex emotions. In contrast, multimodal models that leverage complementary information are consistently superior. Nevertheless, challenges remain, such as the over-reliance on a limited range of non-naturalistic datasets, the subjectivity associated with labelling complex emotions, and models not generalising to the variability in the real world. Finally, a conclusion is drawn by offering a strategic roadmap to guide the continuation of research in recognising complex emotions, including the efficient creation of naturalistic, large datasets for future modelling, the development of more advanced techniques for multimodal fusion, and the targeting of unconsidered but available acoustic features to enhance the modelling of the complexity of human emotions.

Bookmark

View Full Paper

Cite This Study

Lai et al. (Mon,) studied this question.

synapsesocial.com/papers/6a2e4855b1cc60ccdea8c8b1 https://doi.org/https://doi.org/10.1007/s00521-026-12186-w

Bookmark

View Full Paper