As the line between human speakers and "AI-generated" voices becomes increasingly blurred, it is important to understand how sociolinguistic knowledge affects human-computer interaction. Human listeners have been shown to rely on real-world biases, along with acoustic cues and their social associations, to characterize AI-synthesized voices, but it is often unclear if or how these factors interact. We examined these issues by conducting a production and perception study on OpenAI's Whisper-generated voices. Listeners heard each of the generated voices and rated them for perceived demographic features and personality traits. We find that particular voices are consistently associated with specific combinations of age, race/ethnicity, gender, and personality traits; we also find that ratings differ by listener demographics. Acoustic analysis indicates that the voices differ in properties such as subharmonic-to-harmonic ratio, H1-H2, mean f0, and intonational contours. Altogether, we find that listeners from various backgrounds converge on meaningful, imagined personae for synthesized voices, and that prosodic features may influence how listeners arrive at these judgments. Human listeners readily ascribe real-world social characteristics to synthesized voices, demonstrating the importance of human experience in human-computer interaction and the deep entrenchment of social judgment in all kinds of communication, even with non-human actors.
Fleisig et al. (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: