What question did this study set out to answer?

This research aims to understand how self-supervised models can explain variation in speech intelligibility between first-language (L1) and second-language (L2) speakers.

May 14, 2026

Speech intelligibility modeling using a pre-trained self-supervised model

Key Points

This research aims to understand how self-supervised models can explain variation in speech intelligibility between first-language (L1) and second-language (L2) speakers.
Examined speech intelligibility using representations from a pre-trained self-supervised model for sentences by 25 L1 and 114 L2 English talkers.
Analyzed the relationship between representational distance of L2 to L1 talkers and several traditional phonetic measures.
L2 talkers' intelligibility variation is better represented by distance from L1 talkers in model space than by traditional phonetic metrics.
Results indicate substantial potential for self-supervised models to elucidate complex acoustic-phonetic dimensions influencing speech intelligibility.

Abstract

Phonetic theory generally assumes that speech is perceived in terms of acoustic cues, i.e., spectro-temporal properties that map onto discrete linguistic elements. Longer utterances are then presumed to be perceived compositionally, based on combined processing of acoustic cues associated with the sequence of linguistic elements that comprises the utterance. While traditional cue-based approaches have identified acoustic sources of talker-related intelligibility variation, a large portion of variation in both first-language (L1) and second-language (L2) speech intelligibility remains unexplained, especially for sentence-length utterances that involve multiple interacting acoustic dimensions and contextual dependencies. Self-supervised learning models, which are not restricted to purely compositional encoding, may provide novel insight into variation in intelligibility. We tested this by examining pre-trained self-supervised model representations of sentences produced by L1 (n = 25) versus L2 (n = 114) English talkers. We found that variation in intelligibility across L2 talkers is better explained by average distance from L1 talkers in the representational space than by traditional phonetic measures (e.g., vowel space, pitch variability, speech rate). This suggests that pre-trained self-supervised models hold substantial promise for breakthroughs in our understanding of the multitude of acoustic-phonetic dimensions that underlie speech variation and intelligibility. Work supported by NSF DRL grant 2219843.

Bookmark

Speech intelligibility modeling using a pre-trained self-supervised model

Key Points

Abstract

Cite This Study