What question did this study set out to answer?

This research aims to develop FDE-Mamba, a new architecture for improving personal voice activity detection (PVAD) performance.

May 15, 2026Open Access

FDE-Mamba: Selective State Space Modeling for Personal Voice Activity Detection

Key Points

This research aims to develop FDE-Mamba, a new architecture for improving personal voice activity detection (PVAD) performance.
Implemented Mamba blocks with selective state space mechanisms instead of LSTM components in the FDE-RNN architecture.
Maintained essential features of FDE-RNN like weighted residual connections and FiLM-based speaker embedding fusion.
Evaluated the model on the LibriSpeech corpus and conducted ablation studies on architectural components.
Achieved a PVAD mAP of 0.9605, improving 1.97% over the FDE-RNN baseline of 0.9419.
Increased classification accuracy from 86.85% to 89.87%.
Reduced real-time processing factor by 3.16× due to the Mamba selective scan's memory efficiency.

Abstract

Voice Activity Detection (VAD) and Personal Voice Activity Detection (PVAD) are fundamental components in modern voice-based human–machine interaction systems. While VAD distinguishes speech from non-speech segments, PVAD further identifies whether the detected speech belongs to a specific target speaker, enabling more robust performance in multi-speaker environments. Recently, the Flexible Dynamic Encoder RNN (FDE-RNN) has demonstrated state-of-the-art performance on PVAD tasks by leveraging a detachable Personalization module (P-module) built upon a Dynamic Encoder RNN backbone. However, the Long Short-Term Memory (LSTM) networks employed throughout FDE-RNN inherently suffer from sequential processing constraints that prevent parallelization across time steps, and their fixed-size hidden state may restrict representational capacity for fine-grained speaker discrimination. In this paper, we propose FDE-Mamba, which replaces all three LSTM components in FDE-RNN—the Prediction RNN, the Encoder RNN, and the P-module temporal model—with independent Mamba blocks, each equipped with a selective state space mechanism and an expansion layer for enriched feature representation. The proposed architecture retains the weighted residual connection, FiLM-based speaker embedding fusion, and parallel training strategy of the original FDE-RNN without modification. Experimental results on the LibriSpeech corpus demonstrate that FDE-Mamba achieves a PVAD mAP of 0.9605, representing a 1.97% improvement over the reproduced FDE-RNN baseline (0.9419), along with an accuracy improvement from 86.85% to 89.87% and a 3.16× reduction in real-time factor owing to the memory-efficient linear recurrences of the Mamba selective scan during inference, alongside its inherent parallelizability during training. Ablation studies further confirm that both the D skip connection and the expansion layer within each Mamba block contribute meaningfully to the observed performance gains, validating the effectiveness of each architectural design choice. These results suggest that Mamba is a compelling alternative to LSTM for temporal modeling in PVAD systems, and that the proposed integration provides a design blueprint for future selective SSM applications in gated PVAD architectures.

FDE-Mamba: Selective State Space Modeling for Personal Voice Activity Detection

Key Points

Abstract

Cite This Study