Recent zero-shot style-transfer speech synthesis methods have shown promising results and addressed adaptation to unseen speaking styles. While most state-of-the-art methods generalize to new speakers and styles using large models or corpora, achieving similar generalization with a smaller model remains an open challenge. We propose a zero-shot method that uses the small GenerSpeech backbone plus a fine-grained style encoder. To disentangle speakers, global/fine-grained styles, and content embeddings, we introduce a mutual-information minimization loss. To further disentangle style from speaker and boost style embedding diversity, we introduce a maximum-mean-discrepancy-guided cycle consistency loss. Experimental results show the proposed method outperforms baseline zero-shot style-transfer methods (GenerSpeech, YourTTS, VALL-E-X) with a relative average style preference improvement of 31% and a 3.64 prosody prosody similarity mean opinion score on VCTK.
Eren et al. (Sun,) studied this question.