This study presents HGRN2-based Flexible Dynamic Encoder Personal VAD (FDE-HGRN2), a recurrent framework for personal voice activity detection (PVAD). Building on the original LSTM-based FDE-RNN backbone, we replace all recurrent modules with the recently introduced HGRN2 gated linear RNN and adopt a cosine-annealing learning rate schedule to improve both detection accuracy and efficiency. HGRN2 uses gated linear recurrence with non-parametric state expansion, enlarging the recurrent state without increasing the number of trainable parameters and enabling more expressive long-range temporal modeling than conventional LSTMs. We evaluate FDE-HGRN2 on a LibriSpeech-derived PVAD benchmark, where multi-speaker mixtures are constructed by concatenating one to three speakers per utterance and randomly designating a target speaker, following established PVAD data construction practices to ensure direct comparability with prior work. The system uses 40-dimensional Mel-filterbank features as acoustic inputs and conditions the detector on 256-dimensional d-vector embeddings extracted from a pretrained speaker verification network. Experimental results show that FDE-HGRN2 consistently outperforms the original FDE-RNN baseline and several state-of-the-art PVAD models in terms of mean Average Precision and frame-level accuracy, while reducing the parameter count of the recurrent backbone by roughly 15% and yielding substantially smaller models than many competing systems. These findings indicate that HGRN2 provides a more temporally expressive and parameter-efficient alternative to LSTM for PVAD, offering a favorable accuracy–efficiency trade-off for real-world, deployment-oriented personalized speech interfaces.
Building similarity graph...
Analyzing shared references across papers
Loading...
Electronics
National Taiwan Normal University
National Chi Nan University
Add This Paper to Your Research Feed
Any time a new paper drops it will be there.
Wang et al. (Wed,) studied this question.