February 15, 2024

TEA-VITS: emotion voice synthesis based on temporal emotion analysis

Key Points

Key points are not available for this paper at this time.

Abstract

Previous emotional TTS models based on two-stage pipelines are expensive to label to construct natural speech and complicated in the learning process. In addition, there is a problem in which there is a difference between the pitch of the voice and the labelling of the emotion. To address this problem, this study presents TEA-VITS, an End-to-End TTS using Speech-Emotion-Diarization (SED) to generate words with a clear emotional accent. TEA-VITS uses the SED method to recognize temporal emotion in a single voice and label temporal emotion classification changes in voice data for model learning. The results show that they become clearer in terms of emotion than previous emotional TTS models and present new directions for efficient and natural emotional expression synthesis.

Bookmark

Cite This Study

Park et al. (Thu,) studied this question.

synapsesocial.com/papers/68e79098b6db643587702411 https://doi.org/https://doi.org/10.1049/icp.2024.0235

Bookmark