Key points are not available for this paper at this time.
Expressive text-to-speech (TTS) aims to synthesize better human-like speech by incorporating diverse speech styles or emotions. While most expressive TTS models rely on reference speech to condition the style of the generated speech, they often fail to generate speech of regular quality. To ensure consistent speech quality, we propose an expressive TTS conditioned on style representation extracted from the text itself. To implement this text-based style predictor, we design a style module incorporating residual vector quantization. Furthermore, the style representation is enhanced through style-to-text alignment and a mel decoder with style hierarchical layer normalization (SHLN). Our experimental findings demonstrate that our proposed model accurately estimates style representation, enabling the generation of high-quality speech without the need for reference speech.
Building similarity graph...
Analyzing shared references across papers
Loading...
Seong et al. (Sun,) studied this question.
www.synapsesocial.com/papers/68e59e8eb6db643587538a36 — DOI: https://doi.org/10.21437/interspeech.2024-1734
Donghyun Seong
Ho‐Young Lee
Joon‐Hyuk Chang
Building similarity graph...
Analyzing shared references across papers
Loading...
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: