This work proposes a new deep learning-based method for speech emotion recognition, synthesizing Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks, aimed at enhancing the accuracy of emotion class identification. The model used implements both spatial and temporal dependency-based architecture on speech signals exploiting spectrogram-based features such as MFCC features. In order to enhance robustness, CAEmoCyGAN is utilized for data augmentation. The model is trained and validated on the CREMA-D dataset, attaining 95.75%implementation accuracy over anger, fear, happiness, sadness, and disgust emotions. The complementary advantages of CNNs and LSTMs improve emotion detection by the suggested method, surpassing the currently established traditional ML approaches and giving way more noise-robust implementations. This has ample scope in HCI, mental well-being assessment, and customer experience improvement, where precise emotion identification greatly impacts automated responding and support platforms.
Bharshankar et al. (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: