Key points are not available for this paper at this time.
Diffusion-based generative models have recently gained attention in speech enhancement (SE), providing an alternative to conventional supervised methods. These models transform clean speech training samples into Gaussian noise, usually centered on noisy speech, and subsequently learn a parameterized model to reverse this process, conditionally on noisy speech. Unlike supervised methods, generative-based SE approaches often rely solely on an unsupervised loss, which may result in less efficient incorporation of conditioned noisy speech. To address this issue, we propose augmenting the original diffusion training objective with an ℓ 2 loss, measuring the discrepancy between ground-truth clean speech and its estimation at each diffusion time-step. Experimental results demonstrate the effectiveness of our proposed methodology.
Ayilo et al. (Mon,) studied this question.