Key points are not available for this paper at this time.
We propose an integrated model that synthesizes speech by predicting vocal tract shapes from linguistic features as an intermediate representation, followed by the prediction of acoustic features, using articulatory contours extracted from real-time MRI (rtMRI). Previously, we constructed Model 1, which predicts vocal tract shapes from linguistic features, and Model 2, which predicts acoustic features from the shapes captured by rtMRI. However, when these models were cascaded (Model 1 + 2), the synthesized speech showed substantial quality degradation compared to reference speech re-synthesized using the WORLD vocoder. To overcome this, we integrated both tasks into a single model. Objective evaluation showed that the integrated model predicted vocal tract shapes with accuracy comparable to Model 1 + 2, achieving a root mean square error (RMSE) of 1.14 pixels. For acoustic features, it achieved equal or significantly better accuracy for all features except mel-generalized cepstrum (mgc). A Degradation Mean Opinion Score (DMOS) test with 16 participants showed that the integrated model received a significantly higher score (3.31) than Model 1 + 2. We also analyzed the vocal tract regions contributing to F0 prediction, which was possible from mid-sagittal MRI images. These results demonstrate the effectiveness of the proposed model in producing higher-quality speech than the conventional cascade approach.
Building similarity graph...
Analyzing shared references across papers
Loading...
Sumisa Shinkai
Mako Wakita
Hironori Takemoto
The Journal of the Acoustical Society of America
Chiba Institute of Technology
Nara City Hospital
Nara University
Building similarity graph...
Analyzing shared references across papers
Loading...
Shinkai et al. (Wed,) studied this question.
www.synapsesocial.com/papers/6a05680ea550a87e60a20736 — DOI: https://doi.org/10.1121/10.0041537