Abstract Speech Emotion Recognition (SER) is an advanced technology for developing intuitive and empathetic human-computer interfaces (HCI). While traditional SER systems have achievement a certain degree of succeed in recognising basic emotions from acted speech in a closed environment, real-world applications necessitate the recognition of more complex emotions. This paper presents a systematic review of deep learning approaches in SER from 2019 to the present, following the PRISMA guidelines, with a specific focus on the bridge between basic and complex SER within unimodal (audio-only) and multimodal frameworks. Analysis was done on the landscape of emotion models, datasets, and state-of-the-art (SOTA) model architectures, including CNNs, RNNs, Transformers, and their hybrids. The results reveal that deep learning has improved performance; the following hybrid models improved considerably; however, unimodal models still struggle with the subtle and often overlapping acoustic features of complex emotions. In contrast, multimodal models that leverage complementary information are consistently superior. Nevertheless, challenges remain, such as the over-reliance on a limited range of non-naturalistic datasets, the subjectivity associated with labelling complex emotions, and models not generalising to the variability in the real world. Finally, a conclusion is drawn by offering a strategic roadmap to guide the continuation of research in recognising complex emotions, including the efficient creation of naturalistic, large datasets for future modelling, the development of more advanced techniques for multimodal fusion, and the targeting of unconsidered but available acoustic features to enhance the modelling of the complexity of human emotions.
Lai et al. (Mon,) studied this question.