What type of study is this?

August 17, 2025Open Access

Benchmarking the Reponsiveness of Open-Source Text-to-Speech Systems

Key Points

Some open-source TTS models show sub-second latency, suggesting potential for interactive applications.
Benchmarking reveals performance variability, with trade-offs between speed and audio quality observed among models.
A standardized evaluation inspired by MLPerf measures TTS model responsiveness under controlled conditions.
The framework sets a reproducible baseline for comparing TTS systems in latency-sensitive environments.

Abstract

This study addresses a significant gap in voice assistant research by evaluating the responsiveness - the speed at which a TTS system generates speech in reaction to input, crucial for maintaining natural, real-time interactions - of open-source text-to-speech (TTS) models—an often overlooked yet critical component for real-time applications. While extensive benchmarking has been performed on speech-to-text and large language models, little work has focused on how efficiently TTS systems respond in live settings—largely because TTS research has historically prioritized subjective quality metrics like naturalness and intelligibility, which are easier to assess through human listening tests than real-time performance; additionally, the lack of standardized, reproducible tools for measuring latency and responsiveness has further limited progress in this area. This work presents the first comprehensive benchmark focused on responsiveness—assessing TTS latency, tail latency, and real-time processing performance across 13 prominent open-source, readily available models, in contrast to commercial systems like Amazon Polly or Google Cloud TTS, which are closed-source and paywalled. Using a standardized single-stream evaluation inspired by MLPerf Inference, the study measures model responsiveness under controlled conditions and also investigates trade-offs between speed and audio quality. Results reveal substantial variability across models, with some achieving sub-second latency suitable for interactive systems, while others fall short of real-time standards. The benchmark highlights performance bottlenecks in autoregressive architectures and identifies parallel and flow-based models as more efficient for low-latency scenarios. Importantly, the proposed framework provides a reproducible foundation for comparing TTS models in latency-sensitive environments and sets a baseline for future research. By focusing on responsiveness, this work contributes to the development of more effective and natural voice interfaces.

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper

Cite This Study

Dinh et al. (Mon,) studied this question.

synapsesocial.com/papers/68a36c270a429f797332fe9b https://doi.org/https://doi.org/10.20944/preprints202508.0654.v1

KI fragen

Bookmark

View Full Paper