Key points are not available for this paper at this time.
Previous emotional TTS models based on two-stage pipelines are expensive to label to construct natural speech and complicated in the learning process. In addition, there is a problem in which there is a difference between the pitch of the voice and the labelling of the emotion. To address this problem, this study presents TEA-VITS, an End-to-End TTS using Speech-Emotion-Diarization (SED) to generate words with a clear emotional accent. TEA-VITS uses the SED method to recognize temporal emotion in a single voice and label temporal emotion classification changes in voice data for model learning. The results show that they become clearer in terms of emotion than previous emotional TTS models and present new directions for efficient and natural emotional expression synthesis.
Park et al. (Thu,) studied this question.