Key points are not available for this paper at this time.
Advancements in text-to-speech (TTS) synthesis have primarily focused on natural speech and speech intelligibility, but integrating nuanced emotional expressiveness and speaker variability remains a challenge, especially in dynamic environments such as customer service and in assistive speech technologies. This paper introduces a direct text input approach over conventional phoneme-first methods, such as FastSpeech, enhancing user experience. We integrate the extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) along with pitch, energy, and duration in the variance adaptor of the FastSpeech 2 model to deepen the emotional expressiveness of speech. In this paper, we propose a Multi-speaker Emotional Text-to-speech Synthesis System (METTS) which allows users to input desired text, select from various speaker voices, and choose emotional tones ranging from happiness to sadness, surprise, neutrality, and anger. Unique to METTS is the feature that allows users to integrate personal voice datasets, making it highly customizable. We assess speech quality and naturalness with the NISQA model, achieving a 3.72 ± 0.78 MOS score for multi-speaker evaluation and 4.09±0.65 for individual speaker voices. The paper details METTS's architecture, enhancements to FastSpeech2, and methods for embedding emotional and speaker variations.
Kolekar et al. (Mon,) studied this question.