Second-language (L2) speech deviates from L1 speech along multiple acoustic-phonetic dimensions. Free classification tasks have shown that L1 listeners judge similarities (or differences) between L2 speech based on, for instance, a talker’s gender, degree of foreign accent, or perceived L1 background. However, the acoustic-phonetic dimensions underlying these judgments remain poorly understood. We examined the perceptual organization of L2 speech using a self-supervised machine learning model trained on a large set of L1 speech. Sentence recordings of 63 L2 English talkers from 5 L1 backgrounds (11–14 talkers/L1, 118–120 sentences/talker) were transformed into multi-dimensional representations by this pre-trained model. Average inter-talker similarity within this multi-dimensional space was significantly related to L2 intelligibility and L1 background. Specifically, pairs of L2 English talkers with higher L2 intelligibility or shared L1 background were represented as more phonetically similar in the space than pairs of L2 talkers with lower intelligibility or different L1 backgrounds. Our investigation thus proposes a novel way of studying the cognitive representations of speech. The application of machine-learning techniques to the representation and classification of speech samples in a pre-trained, self-supervised, high-dimensional representation space opens the possibility of breakthroughs in our understanding of the multitude of acoustic-phonetic dimensions that underlie speech variation.
Kim et al. (Tue,) studied this question.