Voice Activity Detection (VAD) and Personal Voice Activity Detection (PVAD) are fundamental components in modern voice-based human–machine interaction systems. While VAD distinguishes speech from non-speech segments, PVAD further identifies whether the detected speech belongs to a specific target speaker, enabling more robust performance in multi-speaker environments. Recently, the Flexible Dynamic Encoder RNN (FDE-RNN) has demonstrated state-of-the-art performance on PVAD tasks by leveraging a detachable Personalization module (P-module) built upon a Dynamic Encoder RNN backbone. However, the Long Short-Term Memory (LSTM) networks employed throughout FDE-RNN inherently suffer from sequential processing constraints that prevent parallelization across time steps, and their fixed-size hidden state may restrict representational capacity for fine-grained speaker discrimination. In this paper, we propose FDE-Mamba, which replaces all three LSTM components in FDE-RNN—the Prediction RNN, the Encoder RNN, and the P-module temporal model—with independent Mamba blocks, each equipped with a selective state space mechanism and an expansion layer for enriched feature representation. The proposed architecture retains the weighted residual connection, FiLM-based speaker embedding fusion, and parallel training strategy of the original FDE-RNN without modification. Experimental results on the LibriSpeech corpus demonstrate that FDE-Mamba achieves a PVAD mAP of 0.9605, representing a 1.97% improvement over the reproduced FDE-RNN baseline (0.9419), along with an accuracy improvement from 86.85% to 89.87% and a 3.16× reduction in real-time factor owing to the memory-efficient linear recurrences of the Mamba selective scan during inference, alongside its inherent parallelizability during training. Ablation studies further confirm that both the D skip connection and the expansion layer within each Mamba block contribute meaningfully to the observed performance gains, validating the effectiveness of each architectural design choice. These results suggest that Mamba is a compelling alternative to LSTM for temporal modeling in PVAD systems, and that the proposed integration provides a design blueprint for future selective SSM applications in gated PVAD architectures.
Chiu et al. (Sat,) studied this question.