What question did this study set out to answer?

The research aims to enhance speech synthesis by integrating models that predict vocal tract shapes and acoustic features from real-time MRI data.

May 14, 2026

Speech synthesis via vocal tract shape extracted from real-time MRI

Key Points

The research aims to enhance speech synthesis by integrating models that predict vocal tract shapes and acoustic features from real-time MRI data.
Developed an integrated model combining prediction of vocal tract shapes and acoustic features.
Evaluated the model's performance against a conventional cascade approach using objective metrics like root mean square error (RMSE).
Conducted a Degradation Mean Opinion Score (DMOS) test with 16 participants to assess speech quality.
The integrated model achieved an RMSE of 1.14 pixels in predicting vocal tract shapes, comparable to prior models.
For acoustic features, the integrated model demonstrated equal or better accuracy for most features, except for mel-generalized cepstrum (mgc).
The DMOS score for the integrated model was significantly higher at 3.31 compared to the cascade models.

Abstract

We propose an integrated model that synthesizes speech by predicting vocal tract shapes from linguistic features as an intermediate representation, followed by the prediction of acoustic features, using articulatory contours extracted from real-time MRI (rtMRI). Previously, we constructed Model 1, which predicts vocal tract shapes from linguistic features, and Model 2, which predicts acoustic features from the shapes captured by rtMRI. However, when these models were cascaded (Model 1 + 2), the synthesized speech showed substantial quality degradation compared to reference speech re-synthesized using the WORLD vocoder. To overcome this, we integrated both tasks into a single model. Objective evaluation showed that the integrated model predicted vocal tract shapes with accuracy comparable to Model 1 + 2, achieving a root mean square error (RMSE) of 1.14 pixels. For acoustic features, it achieved equal or significantly better accuracy for all features except mel-generalized cepstrum (mgc). A Degradation Mean Opinion Score (DMOS) test with 16 participants showed that the integrated model received a significantly higher score (3.31) than Model 1 + 2. We also analyzed the vocal tract regions contributing to F0 prediction, which was possible from mid-sagittal MRI images. These results demonstrate the effectiveness of the proposed model in producing higher-quality speech than the conventional cascade approach.

Perguntar à IA

Bookmark

Cite This Study

Shinkai et al. (Wed,) studied this question.

synapsesocial.com/papers/6a05680ea550a87e60a20736 https://doi.org/https://doi.org/10.1121/10.0041537

Perguntar à IA

Bookmark