Recent zero-shot style-transfer speech synthesis methods have shown promising results and addressed adaptation to unseen speaking styles. While most state-of-the-art methods generalize to new speakers and styles using large models or corpora, achieving similar generalization with a smaller model remains an open challenge. We propose a zero-shot method that uses the small GenerSpeech backbone plus a fine-grained style encoder. To disentangle speakers, global/fine-grained styles, and content embeddings, we introduce a mutual-information minimization loss. To further disentangle style from speaker and boost style embedding diversity, we introduce a maximum-mean-discrepancy-guided cycle consistency loss. Experimental results show the proposed method outperforms baseline zero-shot style-transfer methods (GenerSpeech, YourTTS, VALL-E-X) with a relative average style preference improvement of 31% and a 3.64 prosody prosody similarity mean opinion score on VCTK.
Building similarity graph...
Analyzing shared references across papers
Loading...
Eray Eren
Qingju Liu
A. Alwan
JASA Express Letters
SHILAP Revista de lepidopterología
University of California, Los Angeles
IntraMedical Imaging (United States)
Building similarity graph...
Analyzing shared references across papers
Loading...
Eren et al. (Sun,) studied this question.
www.synapsesocial.com/papers/69b3acd302a1e69014ccecec — DOI: https://doi.org/10.1121/10.0042974