Abstract Automatic speech recognition (ASR) provides a scalable solution for annotating large speech datasets, yet inherent transcription errors significantly complicate text-to-speech (TTS) training. While modern diffusion and flow-based architectures achieve high-quality generation, their performance under noisy transcription conditions remains underexplored. This study aims to investigate the robustness of flow- (VITS) and diffusion-based (GradTTS, DVT) models trained on simulated noisy transcriptions, using the autoregressive Tacotron 2 as a baseline. The authors train models on data sets with varying noise levels and evaluate the resulting speech quality using both objective intelligibility metrics and subjective naturalness ratings. Experimental results demonstrate that diffusion-based models exhibit superior robustness, maintaining high intelligibility and naturalness even under high-noise conditions. Furthermore, the behavioral analysis reveals a latent domain separation phenomenon: noisy models spontaneously organize text representations based on transcription quality, despite the absence of explicit labels during training. The authors find that this separation correlates with the resulting text features and degraded synthesis performance. To mitigate this degradation, the authors investigate a text prompt strategy that prepends reliably synthesizable text fragments to guide the model toward activating higher-quality representations. This lightweight approach improves synthesis stability without requiring model fine-tuning.
Feng et al. (Tue,) studied this question.