What question did this study set out to answer?

This research quantifies the impact of audio quality metrics on speaker verification system performance under various degradation conditions.

January 14, 2026Open Access

Quantifying the Relationship Between Speech Quality Metrics and Biometric Speaker Recognition Performance Under Acoustic Degradation

Key Points

This research quantifies the impact of audio quality metrics on speaker verification system performance under various degradation conditions.
Analyzed three SSL-based speaker verification systems: WavLM, Wav2Vec2, and HuBERT.
Applied 21 degradation conditions, including noise contamination, reverberation, and codec compression.
Measured objective audio quality metrics and speaker verification performance metrics.
Up to 80% variance in minDCF explained by audio quality metrics for HuBERT and 78% for WavLM.
PESQ was the strongest predictor for WavLM and HuBERT; shimmer had highest correlation for Wav2Vec2.
WavLM and HuBERT showed more predictable relationships between quality and performance than Wav2Vec2.

Abstract

Self-supervised learning (SSL) models have achieved remarkable success in speaker verification tasks, yet their robustness to real-world audio degradation remains insufficiently characterized. This study presents a comprehensive analysis of how audio quality degradation affects three prominent SSL-based speaker verification systems (WavLM, Wav2Vec2, and HuBERT) across three diverse datasets: TIMIT, CHiME-6, and Common Voice. We systematically applied 21 degradation conditions spanning noise contamination (SNR levels from 0 to 20 dB), reverberation (RT60 from 0.3 to 1.0 s), and codec compression (various bit rates), then measured both objective audio quality metrics (PESQ, STOI, SNR, SegSNR, fwSNRseg, jitter, shimmer, HNR) and speaker verification performance metrics (EER, AUC-ROC, d-prime, minDCF). At the condition level, multiple regression with all eight quality metrics explained up to 80% of the variance in minDCF for HuBERT and 78% for WavLM, but only 35% for Wav2Vec2; EER predictability was lower (69%, 67%, and 28%, respectively). PESQ was the strongest single predictor for WavLM and HuBERT, while Shimmer showed the highest single-metric correlation for Wav2Vec2; fwSNRseg yielded the top single-metric R2 for WavLM, and PESQ for HuBERT and Wav2Vec2 (with much smaller gains for Wav2Vec2). WavLM and HuBERT exhibited more predictable quality-performance relationships compared to Wav2Vec2. These findings establish quantitative relationships between measurable audio quality and speaker verification accuracy at the condition level, though substantial within-condition variability limits utterance-level prediction accuracy.

Quantifying the Relationship Between Speech Quality Metrics and Biometric Speaker Recognition Performance Under Acoustic Degradation

Key Points

Abstract

Cite This Study