February 19, 2024

Advancing AI Voice Synthesis: Integrating Emotional Expression in Multi-Speaker Voice Generation

Key Points

Key points are not available for this paper at this time.

Abstract

Advancements in text-to-speech (TTS) synthesis have primarily focused on natural speech and speech intelligibility, but integrating nuanced emotional expressiveness and speaker variability remains a challenge, especially in dynamic environments such as customer service and in assistive speech technologies. This paper introduces a direct text input approach over conventional phoneme-first methods, such as FastSpeech, enhancing user experience. We integrate the extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) along with pitch, energy, and duration in the variance adaptor of the FastSpeech 2 model to deepen the emotional expressiveness of speech. In this paper, we propose a Multi-speaker Emotional Text-to-speech Synthesis System (METTS) which allows users to input desired text, select from various speaker voices, and choose emotional tones ranging from happiness to sadness, surprise, neutrality, and anger. Unique to METTS is the feature that allows users to integrate personal voice datasets, making it highly customizable. We assess speech quality and naturalness with the NISQA model, achieving a 3.72 ± 0.78 MOS score for multi-speaker evaluation and 4.09±0.65 for individual speaker voices. The paper details METTS's architecture, enhancements to FastSpeech2, and methods for embedding emotional and speaker variations.

Bookmark

Cite This Study

Kolekar et al. (Mon,) studied this question.

synapsesocial.com/papers/68e78968b6db6435876fbdd3 https://doi.org/https://doi.org/10.1109/icaiic60209.2024.10463204

Bookmark