What question did this study set out to answer?

The aim is to compare the performance of YAMNet and VGGish for recognizing emotions from audio signals.

February 14, 2026Open Access

Performance Analysis of YAMNet and VGGish Networks for Emotion Recognition from Audio Signals

Key Points

The aim is to compare the performance of YAMNet and VGGish for recognizing emotions from audio signals.
Conducted a comparative analysis of YAMNet and VGGish models for speech emotion recognition.
Used the RAVDESS and EmoDB datasets for experiments.
Employed traditional machine learning algorithms for classification of extracted features.
Analyzed both static feature extraction and fine-tuning setups of the models.
VGGish outperformed YAMNet in both feature extraction and fine-tuning scenarios.
Achieved highest classification accuracy of 73.83% using VGGish features with logistic regression on EmoDB.
Fine-tuning VGGish yielded an accuracy of 72.90% on EmoDB.
Highest AUC score of 0.9635 was obtained using logistic regression with VGGish on EmoDB.
Fine-tuning both models resulted in a Recall score of 1 for the 'Sadness' emotion.

Abstract

Understanding human emotions through vocal cues is a key point for developing emotionally intelligent systems, particularly in fields such as human-computer interaction, healthcare, and virtual assistants. However, accurately recognizing emotions from speech remains a challenging task due to the variability in speaker traits, acoustic conditions, and the subtle, often overlapping nature of emotional states. In this study, a comparative analysis of transfer learning methods for speech emotion recognition (SER) was presented by employing pretrained audio-based neural networks. Specifically, YAMNet and VGGish models were employed both as static feature extractors and in a fine-tuning setup. The extracted embeddings were classified using traditional machine learning algorithms, including Support Vector Machines (SVM), K-Nearest Neighbors (KNN), Random Forests (RF), and Logistic Regression (LR). Experiments were conducted on two widely used emotional speech datasets: RAVDESS and EmoDB. The results demonstrate that VGGish consistently outperforms YAMNet in both feature extraction and fine-tuning scenarios. The highest classification accuracy was achieved using VGGish features with LR on EmoDB (73.83%). Additionally, fine-tuning VGGish on EmoDB yielded a competitive accuracy of 72.90%. Also class-specific analysis showed that the highest AUC score of 0.9635 was obtained using the LR in VGGish + EmoDB setting, while fine-tuning both YAMNet and VGGish with EmoDB dataset has reached up to Recall score of 1 for the ‘Sadness’ emotion.

Mark Helpful

Bookmark

Relay

View Full Paper