Key points are not available for this paper at this time.
We present a lightweight neural network with attentive score loss for frame-wise personalized voice activity detection (i.e., AS-pVAD). Instead of using an external speaker embedding extractor with a large number of parameters, AS-pVAD employs a lightweight internal model to extract the target speaker embedding. A novel attentive score loss constraint is proposed to better exploit such embedding clues for pVAD compared to conventional embedding concatenation. Through joint training with a regular VAD, AS-pVAD can be further improved to identify the target speaker in the enrollment cases while it is able to function as a regular VAD in the enrollment-less cases. Experimental results show that AS-pVAD achieves over 0.9 of AUCROC on average in two-speaker talking scenario under various noisy and reverberant environments. Our test set is also publicly released to the community to facilitate the research in this area.
Liu et al. (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: