Key points are not available for this paper at this time.
YourTTS brings the power of a multilingual approach to the task of zero-shot-speaker TTS. Our method builds upon the VITS model and adds several novel for zero-shot multi-speaker and multilingual training. We state-of-the-art (SOTA) results in zero-shot multi-speaker TTS and comparable to SOTA in zero-shot voice conversion on the VCTK dataset. , our approach achieves promising results in a target language with single-speaker dataset, opening possibilities for zero-shot multi-speaker TTS zero-shot voice conversion systems in low-resource languages. Finally, it possible to fine-tune the YourTTS model with less than 1 minute of speech achieve state-of-the-art results in voice similarity and with reasonable. This is important to allow synthesis for speakers with a very voice or recording characteristics from those seen during training.
Casanova et al. (Sat,) studied this question.