Key points are not available for this paper at this time.
End-to-end single-stage text-to-speech models have garnered significant attention in recent research, surpassing the performance of conventional two-stage pipeline systems. While prior single-stage models have made substantial advancements, there remains room for improvement in addressing intermittent issues related to unnaturalness and prosody diversity. Unlike previous works, we propose a novel single-stage TTS framework to tackle these problems via hierarchical denoising diffusion generative adversarial networks (GAN) modeling, which parameterizes the denoising model by directly predicting latent variables to improve the naturalness and diversity of the generated speech. Specifically, a conditional GAN is adopted as a non-Gaussian multimodal function to model the denoising distribution, which construct the duration predictor and speech decoder respectively. As such, it allows the TTS model learn the more natural one-to-many relationship in which a text input can be spoken in multiple ways with different pitches and rhythms. In addition, We show that DETS can generate high-fidelity speech waveform with only 1 denoising step. Extensive experimental results on the LJSpeech benchmark dataset demonstrate the favourable performance of the proposed method.
Wang et al. (Mon,) studied this question.