What question did this study set out to answer?

The aim is to enhance zero-shot style transfer in text-to-speech systems using smaller models without compromising quality.

March 13, 2026Open Access

Improving zero-shot style transfer text-to-speech by disentangled fine-grained style modeling

Key Points

The aim is to enhance zero-shot style transfer in text-to-speech systems using smaller models without compromising quality.
Proposed a zero-shot method leveraging the GenerSpeech backbone and fine-grained style encoders.
Implemented a mutual-information minimization loss to separate speaker identities and styles.
Applied a maximum-mean-discrepancy-guided cycle consistency loss for better style embedding diversity.
Achieved a relative average style preference improvement of 31% over baseline methods.
Obtained a prosody similarity mean opinion score of 3.64 on the VCTK dataset.

Abstract

Recent zero-shot style-transfer speech synthesis methods have shown promising results and addressed adaptation to unseen speaking styles. While most state-of-the-art methods generalize to new speakers and styles using large models or corpora, achieving similar generalization with a smaller model remains an open challenge. We propose a zero-shot method that uses the small GenerSpeech backbone plus a fine-grained style encoder. To disentangle speakers, global/fine-grained styles, and content embeddings, we introduce a mutual-information minimization loss. To further disentangle style from speaker and boost style embedding diversity, we introduce a maximum-mean-discrepancy-guided cycle consistency loss. Experimental results show the proposed method outperforms baseline zero-shot style-transfer methods (GenerSpeech, YourTTS, VALL-E-X) with a relative average style preference improvement of 31% and a 3.64 prosody prosody similarity mean opinion score on VCTK.

Improving zero-shot style transfer text-to-speech by disentangled fine-grained style modeling

Key Points

Abstract

Cite This Study