What question did this study set out to answer?

To investigate the factors influencing the naturalness and similarity of AI voice-cloned speech across different systems.

June 14, 2026Open Access

What determines the success of AI voice-cloned speech? Prosodic and acoustic evidence on three TTS systems

Key Points

To investigate the factors influencing the naturalness and similarity of AI voice-cloned speech across different systems.
Comprehensive analysis of three TTS systems (ElevenLabs, StyleTTS-V2, XTTS-V2)
Evaluation of prosodic features: pitch, amplitude, speech rate, rhythm, and intonation
Assessment of speaker-embedding similarity based on listener judgments.
ElevenLabs achieved the highest similarity to human speech across several prosodic measures.
Significant differences noted in speech rate and vowel-based rhythm across the systems.
Prosodic timing and pitch control identified as key correlates of perceived naturalness.

Abstract

Abstract Artificial intelligence (AI) voice cloning systems have advanced rapidly, enabling applications in education and assistive technologies. Yet listeners’ perceptual ratings of naturalness and similarity remain inconsistent: some systems approach human-like quality, while others sound noticeably artificial. Here we present a comprehensive prosodic and computational analysis of voice-cloned speech across three voice-cloning systems (ElevenLabs, StyleTTS-V2, XTTS-V2), building on the listener judgments of these stimuli reported in Bakkouche et al. (Finding the human voice in AI: Insights on the perception of AI-voice clones from naturalness and similarity ratings. In Proceedings of Interspeech 2025, 2190–2194. Rotterdam, The Netherlands: ISCA. https: //www. isca-archive. org/interspeech₂025/bakkouche25ᵢnterspeech. pdf (accessed 14 October 2025) ). We analysed pitch, amplitude, speech rate, rhythm, intonation, and speaker-embedding similarity. Overall, ElevenLabs showed the closest correspondence to human speech across several prosodic and speaker-identity measures, although system differences were not uniform across all dimensions. The clearest acoustic differences between models were observed in speech rate, vowel-based rhythm measures, local pitch-control measures, and speaker-embedding similarity. These acoustic findings are consistent with listeners’ perceptual judgments of naturalness and suggest that prosodic timing, rhythm, and fine-grained pitch control are potential correlates of perceived naturalness, and that improvement of these features can contribute to the development of more natural-sounding synthesised speech.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Bakkouche et al. (Wed,) studied this question.

synapsesocial.com/papers/6a2e4632b1cc60ccdea8af7a https://doi.org/https://doi.org/10.1515/phon-2025-0062

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Bookmark

View Full Paper