What question did this study set out to answer?

The aim is to explore how well flow-based and diffusion-based speech models perform when trained on inaccurate transcriptions.

May 10, 2026Open Access

An investigation of the robustness of flow- and diffusion-based speech generation models on noisy transcriptions

Key Points

The aim is to explore how well flow-based and diffusion-based speech models perform when trained on inaccurate transcriptions.
Models trained on datasets with various noise levels
Evaluation using intelligibility metrics and naturalness ratings
Behavioral analysis of text representations based on transcription quality
Diffusion-based models maintain high intelligibility under high noise levels (p<0.05).
The latent domain separation phenomenon correlates with text features and synthesis quality.
Text prompt strategy improves synthesis stability without fine-tuning the models.

Abstract

Abstract Automatic speech recognition (ASR) provides a scalable solution for annotating large speech datasets, yet inherent transcription errors significantly complicate text-to-speech (TTS) training. While modern diffusion and flow-based architectures achieve high-quality generation, their performance under noisy transcription conditions remains underexplored. This study aims to investigate the robustness of flow- (VITS) and diffusion-based (GradTTS, DVT) models trained on simulated noisy transcriptions, using the autoregressive Tacotron 2 as a baseline. The authors train models on data sets with varying noise levels and evaluate the resulting speech quality using both objective intelligibility metrics and subjective naturalness ratings. Experimental results demonstrate that diffusion-based models exhibit superior robustness, maintaining high intelligibility and naturalness even under high-noise conditions. Furthermore, the behavioral analysis reveals a latent domain separation phenomenon: noisy models spontaneously organize text representations based on transcription quality, despite the absence of explicit labels during training. The authors find that this separation correlates with the resulting text features and degraded synthesis performance. To mitigate this degradation, the authors investigate a text prompt strategy that prepends reliably synthesizable text fragments to guide the model toward activating higher-quality representations. This lightweight approach improves synthesis stability without requiring model fine-tuning.

Read Full Paperexternally

Ask AI

Mark Helpful

Bookmark

Relay

View Full Paper