Key points are not available for this paper at this time.
Real-time magnetic resonance image (rtMRI) data of the upper airway provides a rich source of information about vocal tract shaping that can inform phonemic analysis and classification. We describe a multimodal phonemic classifier that combines articulatory data with speech audio features to improve performance. A deep network model processes rtMRI video data using ResNet18 and speech audio using a custom CNN and then combines the two data streams using a Transformer layer to fully explore the correlation of the two streams towards better vowel-consonant-vowel classification via the Transformer's multi-head self-attention mechanism. The classification accuracy of both the unimodal and multimodal models show substantial improvement on previous work (> 38%). The addition of audio features improves classification accuracy in the multimodal model by 7% compared with the unimodal model using articulatory data. We analyze the model and discuss the phonetic implications.
Building similarity graph...
Analyzing shared references across papers
Loading...
Yue et al. (Sun,) studied this question.
www.synapsesocial.com/papers/68e59e92b6db643587538a83 — DOI: https://doi.org/10.21437/interspeech.2024-840
Yaoyao Yue
Michael Proctor
Luping Zhou
Building similarity graph...
Analyzing shared references across papers
Loading...