What type of study is this?

This is a Quantitative Study study.

September 30, 2025Open Access

MMFA: Masked Multi-Layer Feature Aggregation for Speaker Verification Using WavLM

Key Points

MMFA improves speaker verification performance, emphasizing speaker-relevant frames and reducing noise.
On VoxCeleb1, MMFA shows improvements in equal error rate and minimum detection cost function metrics.
The method applies layer-wise attention, allowing independent consideration of frame relevance across SSL embeddings.
Attention_map analysis reveals distinct layer selection patterns, validating the robustness of MMFA in speaker recognition.

Abstract

Self-supervised learning (SSL) models such as WavLM and wav2vec 2.0 have become front ends for speaker verification (SV), providing multi-layer speech representations without labeled data. Lower layers capture acoustic details, whereas higher layers encode phonetic and contextual information. The spread of wearables such as smartwatches, earbuds and AR/VR headsets has increased demand for privacy-preserving on-device SV that runs under tight compute and power budgets and remains robust to short, noisy utterances. Conventional SV systems typically use the final layer or a weighted aggregation of layers with a single temporal attention step, implicitly assuming frame importance is shared across layers and underusing the hierarchical diversity of SSL embeddings. We argue that frame relevance is layer dependent: each layer highlights different aspects of speech, and the frames most critical for speaker identity differ across layers. We therefore propose Masked Multi-layer Feature Aggregation (MMFA), which applies frame-wise attention independently within each layer before learnable layer-wise weighting. This emphasizes speaker-relevant frames and suppresses irrelevant ones such as silence or noise while integrating complementary information across layers. On VoxCeleb1, MMFA yields consistent gains over strong baselines in both EER and minDCF, and attention-map analyses reveal distinct per-layer selection patterns, validating MMFA for robust SV.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper