September 1, 2024

TSP-TTS: Text-based Style Predictor with Residual Vector Quantization for Expressive Text-to-Speech

Key Points

Key points are not available for this paper at this time.

Abstract

Expressive text-to-speech (TTS) aims to synthesize better human-like speech by incorporating diverse speech styles or emotions. While most expressive TTS models rely on reference speech to condition the style of the generated speech, they often fail to generate speech of regular quality. To ensure consistent speech quality, we propose an expressive TTS conditioned on style representation extracted from the text itself. To implement this text-based style predictor, we design a style module incorporating residual vector quantization. Furthermore, the style representation is enhanced through style-to-text alignment and a mel decoder with style hierarchical layer normalization (SHLN). Our experimental findings demonstrate that our proposed model accurately estimates style representation, enabling the generation of high-quality speech without the need for reference speech.

اسأل الذكاء الاصطناعي

Bookmark

Cite This Study

Seong et al. (Sun,) studied this question.

synapsesocial.com/papers/68e59e8eb6db643587538a36 https://doi.org/https://doi.org/10.21437/interspeech.2024-1734

اسأل الذكاء الاصطناعي

Bookmark