Phonetic theory generally assumes that speech is perceived in terms of acoustic cues, i.e., spectro-temporal properties that map onto discrete linguistic elements. Longer utterances are then presumed to be perceived compositionally, based on combined processing of acoustic cues associated with the sequence of linguistic elements that comprises the utterance. While traditional cue-based approaches have identified acoustic sources of talker-related intelligibility variation, a large portion of variation in both first-language (L1) and second-language (L2) speech intelligibility remains unexplained, especially for sentence-length utterances that involve multiple interacting acoustic dimensions and contextual dependencies. Self-supervised learning models, which are not restricted to purely compositional encoding, may provide novel insight into variation in intelligibility. We tested this by examining pre-trained self-supervised model representations of sentences produced by L1 (n = 25) versus L2 (n = 114) English talkers. We found that variation in intelligibility across L2 talkers is better explained by average distance from L1 talkers in the representational space than by traditional phonetic measures (e.g., vowel space, pitch variability, speech rate). This suggests that pre-trained self-supervised models hold substantial promise for breakthroughs in our understanding of the multitude of acoustic-phonetic dimensions that underlie speech variation and intelligibility. Work supported by NSF DRL grant 2219843.
Bradlow et al. (Wed,) studied this question.