Key points are not available for this paper at this time.
In the Text-to-speech(TTS) task, the latent diffusion model has excellent fidelity and generalization, but its expensive resource consumption and slow inference speed have always been a challenging. To address this issue, this paper proposes the Discrete Diffusion Model with Contrastive Learning for Text-to-Speech Generation(DCTTS). Specifically, we employs a straightforward and effective text encoder, compresses the raw data into discrete space using VQ model, and then trains the diffusion model on the discrete space. In order to minimize the number of diffusion steps needed to synthesis high-quality speech, we used a contrastive learning loss throughout the diffusion model training phase. The experimental results demonstrate that the approach proposed in this paper has outstanding speech synthesis quality and sampling speed while significantly reducing the resource consumption of diffusion model. The synthesized samples are available at https://github.com/lawtherWu/DCTTS
Wu et al. (Mon,) studied this question.