Key points are not available for this paper at this time.
Articulatory speech synthesis is a challenging task which requires mapping of time-varying articulatory trajectories and speech. In recent years, deep learning methods have been proposed for speech synthesis which have achieved significant progress towards human-like speech generation. However, articulatory speech synthesis is far from human-level performance. Thus, in this work, we further improve the results of articulatory speech synthesis to enhance synthesis quality. We consider a deep learning-based sequence-to-sequence baseline. We improve upon this network using a novel approach of labelaware contrastive learning using framewise phoneme alignment to learn better representations of the articulatory trajectories. With this approach, we obtain a relative improvement in Word Error Rate (WER) of 5.8% over the baseline. We also conduct mean opinion score (MOS) tests and other objective metrics to further evaluate our proposed models.
Bandekar et al. (Sun,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: