Speech enhancement (SE) aims to recover clean speech from noisy speech while preserving intelligibility, speaker identity, and runtime efficiency. Existing language-model (LM)-based methods may lose fine acoustic details due to discretization, whereas diffusion models often require many iterative denoising steps. This study proposes an efficient speech enhancement framework based on flow matching and a gated bidirectional Mamba2 backbone. The model predicts a continuous velocity field in the Mel-spectrogram domain and introduces a DiMamba block that captures past and future context through shared-weight bidirectional state-space modeling with adaptive gating. Experiments on the DNS Challenge test set and additional VoiceBank test data show that the proposed method achieves strong perceptual quality and speaker preservation while substantially reducing inference cost. The model reaches a real-time factor of 0.31, more than five times faster than diffusion baselines, and achieves a word error rate of 4.7% and a quality mean opinion score of 3.58. These results indicate that flow matching combined with gated bidirectional Mamba2 provides an effective quality–efficiency trade-off for offline speech enhancement.
Yuan et al. (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: