What question did this study set out to answer?

This study aims to understand how first language phonological systems influence acoustically generated articulation in second language speech.

May 14, 2026

L1 influence on stability in speech foundation model-based articulatory mapping of L2 English speech

Key Points

This study aims to understand how first language phonological systems influence acoustically generated articulation in second language speech.
Utilized a speech foundation model (WavLM-large) trained on 94,000 hours of English audio for articulatory-to-acoustic inversion (AAI).
Evaluated AAI performance using resynthesis comparing articulatory trajectories in two datasets: L2-ARCTIC and CMU ARCTIC.
Hypothesized that speaker stability varies with language rhythmic structures and segmental inventories.
Speakers of Germanic languages show more stable inversion performance compared to those from syllable-timed or tonal languages.
Divergent first language backgrounds lead to greater trajectory mismatches during acoustic inversion.
Findings highlight the influence of L1 phonological systems on AAI, suggesting a need for approaches tailored to linguistic typologies.

Abstract

This study investigates how first language (L1) phonological systems affect the stability of articulatory-to-acoustic inversion (AAI) in second language (L2) English speech using a speech foundation model-based approach. We leverage an AAI system built on WavLM-large, pretrained on 94 000 h of English audio from diverse domains and further trained to predict articulatory trajectories using electromagnetic articulography data from a native English speaker. This supervision enables the model to approximate vocal tract movements but encodes English L1 articulatory priors, limiting generalization to diverse L2 backgrounds. We hypothesize that speakers of languages with rhythmic structures and segmental inventories similar to English will exhibit more stable AAI, while speakers of more divergent L1s will show greater trajectory mismatch. Inversion performance was evaluated using a round-trip resynthesis procedure comparing inferred articulatory trajectories before and after resynthesis, using two publicly available corpora (L2-ARCTIC & CMU ARCTIC). Results show systematic variation across L1s. Speakers of Germanic languages (English varieties, German) tend to yield more stable inversion, while speakers of syllable-timed (Spanish, Korean), tonal (Mandarin, Vietnamese), or laryngeally complex (Arabic, Hebrew, Indian varieties) languages show greater mismatch. Our findings offer evidence of L1-driven articulatory biases, highlighting the need for typologically informed approaches to articulatory supervision. Work supported by IARPA.

Mark Helpful

Bookmark

Relay