Self-supervised learning (SSL) models such as WavLM and wav2vec 2.0 have become front ends for speaker verification (SV), providing multi-layer speech representations without labeled data. Lower layers capture acoustic details, whereas higher layers encode phonetic and contextual information. The spread of wearables such as smartwatches, earbuds and AR/VR headsets has increased demand for privacy-preserving on-device SV that runs under tight compute and power budgets and remains robust to short, noisy utterances. Conventional SV systems typically use the final layer or a weighted aggregation of layers with a single temporal attention step, implicitly assuming frame importance is shared across layers and underusing the hierarchical diversity of SSL embeddings. We argue that frame relevance is layer dependent: each layer highlights different aspects of speech, and the frames most critical for speaker identity differ across layers. We therefore propose Masked Multi-layer Feature Aggregation (MMFA), which applies frame-wise attention independently within each layer before learnable layer-wise weighting. This emphasizes speaker-relevant frames and suppresses irrelevant ones such as silence or noise while integrating complementary information across layers. On VoxCeleb1, MMFA yields consistent gains over strong baselines in both EER and minDCF, and attention-map analyses reveal distinct per-layer selection patterns, validating MMFA for robust SV.
Lee et al. (Mon,) studied this question.