What question did this study set out to answer?

This research aims to improve the naturalness and expressiveness of text-to-speech output by addressing prosodic diversity.

December 8, 2025Open Access

Using Denoising Diffusion Model for Predicting Global Style Tokens in an Expressive Text-to-Speech System

Key Points

This research aims to improve the naturalness and expressiveness of text-to-speech output by addressing prosodic diversity.
Propose a technique utilizing a diffusion framework for predicting speaking-style features.
Incorporate Global Style Tokens-based embeddings to condition a neural TTS model.
Conduct quantitative and qualitative evaluations of the system's performance.
The system generates expressive human speech with diverse prosodic features.
Non-deterministic sampling enhances the naturalness of the generated speech.

Abstract

Text-to-speech (TTS) systems based on neural networks have undergone a significant evolution, taking a step forward towards achieving human-like quality and expressiveness, which is crucial for applications such as social media content creation and voice interfaces for visually impaired individuals. An entire branch of research, known as Expressive Text-to-speech (ETTS), has emerged to address the so-called one-to-many mapping problem, which limits the naturalness of generated output. However, most ETTS systems applying explicit style modeling treat the prediction of prosodic features as a regressive, rather than generative, process and, consequently, do not capture prosodic diversity. We address this problem by proposing a novel technique for inference-time prediction of speaking-style features, which leverages a diffusion framework for sampling from a learned space of Global Style Tokens-based embeddings, which are then used to condition a neural TTS model. By incorporating the diffusion model, we can leverage its powerful modeling capabilities to learn the distribution of possible stylistic features and, during inference, sample them non-deterministically, which makes the generated speech more human-like by alleviating prosodic monotony across multiple sentences. Our system blends a regressive predictor with a diffusion-based generator to enable smooth control over the diversity of generated speech. Through quantitative and qualitative (human-centered) experiments, we demonstrated that our system generates expressive human speech with non-deterministic high-level prosodic features.

Read Full Paperexternally

Perguntar à IA

Bookmark

View Full Paper