What question did this study set out to answer?

The aim is to develop a lightweight framework for personal voice activity detection that improves detection accuracy while maintaining efficiency.

April 10, 2026Open Access

HGRN2-Based Personal Voice Activity Detection: A Lightweight Recurrent Framework for Inference and Training

Key Points

The aim is to develop a lightweight framework for personal voice activity detection that improves detection accuracy while maintaining efficiency.
Introduced FDE-HGRN2, a new recurrent framework replacing LSTM with HGRN2 gated linear RNN.
Used cosine-annealing learning rate schedule for training.
Evaluated on LibriSpeech-derived PVAD benchmark with multiple speakers and target designations.
Utilized 40-dimensional Mel-filterbank features and 256-dimensional d-vector embeddings as inputs.
FDE-HGRN2 outperforms original FDE-RNN baseline and multiple leading PVAD models.
Achieved improved mean Average Precision and frame-level accuracy.
Reduced parameter count of the recurrent backbone by approximately 15%, leading to smaller models.

Abstract

This study presents HGRN2-based Flexible Dynamic Encoder Personal VAD (FDE-HGRN2), a recurrent framework for personal voice activity detection (PVAD). Building on the original LSTM-based FDE-RNN backbone, we replace all recurrent modules with the recently introduced HGRN2 gated linear RNN and adopt a cosine-annealing learning rate schedule to improve both detection accuracy and efficiency. HGRN2 uses gated linear recurrence with non-parametric state expansion, enlarging the recurrent state without increasing the number of trainable parameters and enabling more expressive long-range temporal modeling than conventional LSTMs. We evaluate FDE-HGRN2 on a LibriSpeech-derived PVAD benchmark, where multi-speaker mixtures are constructed by concatenating one to three speakers per utterance and randomly designating a target speaker, following established PVAD data construction practices to ensure direct comparability with prior work. The system uses 40-dimensional Mel-filterbank features as acoustic inputs and conditions the detector on 256-dimensional d-vector embeddings extracted from a pretrained speaker verification network. Experimental results show that FDE-HGRN2 consistently outperforms the original FDE-RNN baseline and several state-of-the-art PVAD models in terms of mean Average Precision and frame-level accuracy, while reducing the parameter count of the recurrent backbone by roughly 15% and yielding substantially smaller models than many competing systems. These findings indicate that HGRN2 provides a more temporally expressive and parameter-efficient alternative to LSTM for PVAD, offering a favorable accuracy–efficiency trade-off for real-world, deployment-oriented personalized speech interfaces.

Read Full Paperexternally

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Journals

Electronics

Institutions

National Taiwan Normal University

National Chi Nan University

References and Citations

Add This Paper to Your Research Feed

Any time a new paper drops it will be there.