AS-pVAD: A Frame-Wise Personalized Voice Activity Detection Network with Attentive Score Loss

Key Points

Key points are not available for this paper at this time.

Abstract

We present a lightweight neural network with attentive score loss for frame-wise personalized voice activity detection (i.e., AS-pVAD). Instead of using an external speaker embedding extractor with a large number of parameters, AS-pVAD employs a lightweight internal model to extract the target speaker embedding. A novel attentive score loss constraint is proposed to better exploit such embedding clues for pVAD compared to conventional embedding concatenation. Through joint training with a regular VAD, AS-pVAD can be further improved to identify the target speaker in the enrollment cases while it is able to function as a regular VAD in the enrollment-less cases. Experimental results show that AS-pVAD achieves over 0.9 of AUCROC on average in two-speaker talking scenario under various noisy and reverberant environments. Our test set is also publicly released to the community to facilitate the research in this area.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper