Key points are not available for this paper at this time.
Recently, self-supervised learning (SSL) has emerged as a promising strategy for constructing speaker verification (SV) systems, effectively mitigating the cost and privacy issues associated with the labeling process. The majority of SSL-based SV systems tend to focus on utterance-level features, potentially overlooking the inherent inter-frame structure of speech. To bridge this gap, we propose the relational mask prediction (RMP), a novel loss function that encourages models to understand the relationships between frames. Additionally, we introduce a block aggregation Transformer (BA-Transformer) to enrich frame-level features. Models were trained without labels using the VoxCeleb2 development set and comprehensively evaluated using various test sets. Experimental results demonstrate that the proposed framework outperforms recent SSL-based SV systems, achieving an average performance improvement of 22.39% over the baseline across the entire evaluation dataset.
Kim et al. (Sun,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: