What question did this study set out to answer?

This research aims to assess the effectiveness of pre-trained audio embeddings for classifying traditional Portuguese musical instruments.

May 11, 2026Open Access

Evaluating pre-trained audio embeddings for the classification of traditional Portuguese musical instruments

Key Points

This research aims to assess the effectiveness of pre-trained audio embeddings for classifying traditional Portuguese musical instruments.
Utilized a dataset of 1734 field recordings from the MPAGDP archive.
Implemented session-based stratification to prevent data leakage.
Compared YAMNet, VGGish, and OpenL3 for performance in audio classification.
OpenL3 achieved a mean frame-level accuracy of 91.13% ± 3.31 and a Macro F1 score of 0.884 ± 0.031.
YAMNet excelled with 94.7% accuracy in environments with vocal interference.
OpenL3 showed resilience to noise with 78.5% accuracy at 5 dB SNR.

Abstract

The preservation of Intangible Cultural Heritage (ICH) faces challenges in managing large volumes of unstructured digital audio. Existing Music Information Retrieval (MIR) systems often underperform in this domain as they are optimised for commercial music. This paper evaluates feature extraction for seven traditional Portuguese instruments using 1734 field recordings from the “A Música Portuguesa a Gostar Dela Própria” (MPAGDP) archive. We implemented strict session-based stratification to prevent data leakage. The performance of YAMNet, VGGish, and OpenL3 was compared. Overall, results show that while OpenL3 provides the highest average timbral resolution (mean frame-level accuracy: 91.13% ± 3.31; Macro F1: 0.884 ± 0.031), a significant performance trade-off exists. Quantitative stress tests reveal OpenL3 is most resilient to synthetic Gaussian noise (78.5% accuracy at 5 dB SNR), making it ideal for high-precision archival digitisation. Conversely, YAMNet excels in uncontrolled fieldwork with vocal interference (94.7% accuracy), offering a more robust filter for non-musical semantic noise. Additionally, an architectural ablation study justifies a Dense MLP classifier, proving that simpler heads outperform sequential models (LSTM or Transformer) in these low-resource contexts. These findings offer a flexible technical roadmap: OpenL3 is recommended for institutional repositories requiring maximum resolution, while YAMNet is optimal for mobile devices or environments with high vocal overlap, providing a robust solution for preserving regional musical memory. • A novel audio dataset of traditional Portuguese instruments stratified by recording session. • Comparison of YAMNet, VGGish, and OpenL3 embeddings for heritage audio classification. • OpenL3 achieves superior performance (mean 91.13% accuracy) in discerning similar string instruments across eight randomised seeds. • A statistically robust split methodology validated through eight independent iterations that prevents data leakage, common in small folk music datasets.

Evaluating pre-trained audio embeddings for the classification of traditional Portuguese musical instruments

Key Points

Abstract

Cite This Study