Understanding human emotions through vocal cues is a key point for developing emotionally intelligent systems, particularly in fields such as human-computer interaction, healthcare, and virtual assistants. However, accurately recognizing emotions from speech remains a challenging task due to the variability in speaker traits, acoustic conditions, and the subtle, often overlapping nature of emotional states. In this study, a comparative analysis of transfer learning methods for speech emotion recognition (SER) was presented by employing pretrained audio-based neural networks. Specifically, YAMNet and VGGish models were employed both as static feature extractors and in a fine-tuning setup. The extracted embeddings were classified using traditional machine learning algorithms, including Support Vector Machines (SVM), K-Nearest Neighbors (KNN), Random Forests (RF), and Logistic Regression (LR). Experiments were conducted on two widely used emotional speech datasets: RAVDESS and EmoDB. The results demonstrate that VGGish consistently outperforms YAMNet in both feature extraction and fine-tuning scenarios. The highest classification accuracy was achieved using VGGish features with LR on EmoDB (73.83%). Additionally, fine-tuning VGGish on EmoDB yielded a competitive accuracy of 72.90%. Also class-specific analysis showed that the highest AUC score of 0.9635 was obtained using the LR in VGGish + EmoDB setting, while fine-tuning both YAMNet and VGGish with EmoDB dataset has reached up to Recall score of 1 for the ‘Sadness’ emotion.
Yunus Korkmaz (Wed,) studied this question.