Los puntos clave no están disponibles para este artículo en este momento.
Speech Emotion Recognition (SER) has recently grabbed huge attention due to its application in Human-Computer Interface, human-robot interaction, assessment of human behavior, virtual reality, and mostly noteworthy high AI-driven environments. It's not that easy to extract emotion from human physiological information. Previously, several attempts have been made by researchers to study SER for different purposes with various audio feature variations in several approaches such as SVM, MLP, CNN, LSTM, and many other approaches. Most of the approaches use audio databases RAVDESS, IMOCAP, EMO-DB, etc. In this paper, we have focused on improving the accuracy of emotion detection from speech most precisely. For this purpose, we've used three datasets RAVDESS, TESS, and CREMA-D, and merged these to form a large dataset. A hybrid model using CNN and BiLSTM deep learning network is proposed to recognize eight emotions happy, calm, sad, surprised, neutral, angry, disgust, and fear. The model is trained, validated, and tested using three features: Zero Crossing Rate (ZCR), Root Mean Square Energy (RMSE), and Mel Frequency Cepstral coefficient (MFCC) extracted from raw datasets. After extensive experiments, the proposed CNN-BiLSTM model achieved the highest accuracy of 97.8% using three merged datasets.
Islam et al. (Thu,) studied this question.