Abstract Artificial intelligence (AI) voice cloning systems have advanced rapidly, enabling applications in education and assistive technologies. Yet listeners’ perceptual ratings of naturalness and similarity remain inconsistent: some systems approach human-like quality, while others sound noticeably artificial. Here we present a comprehensive prosodic and computational analysis of voice-cloned speech across three voice-cloning systems (ElevenLabs, StyleTTS-V2, XTTS-V2), building on the listener judgments of these stimuli reported in Bakkouche et al. (Finding the human voice in AI: Insights on the perception of AI-voice clones from naturalness and similarity ratings. In Proceedings of Interspeech 2025, 2190–2194. Rotterdam, The Netherlands: ISCA. https: //www. isca-archive. org/interspeech₂025/bakkouche25ᵢnterspeech. pdf (accessed 14 October 2025) ). We analysed pitch, amplitude, speech rate, rhythm, intonation, and speaker-embedding similarity. Overall, ElevenLabs showed the closest correspondence to human speech across several prosodic and speaker-identity measures, although system differences were not uniform across all dimensions. The clearest acoustic differences between models were observed in speech rate, vowel-based rhythm measures, local pitch-control measures, and speaker-embedding similarity. These acoustic findings are consistent with listeners’ perceptual judgments of naturalness and suggest that prosodic timing, rhythm, and fine-grained pitch control are potential correlates of perceived naturalness, and that improvement of these features can contribute to the development of more natural-sounding synthesised speech.
Bakkouche et al. (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: