September 1, 2024

Articulatory synthesis using representations learnt through phonetic label-aware contrastive loss

Key Points

Key points are not available for this paper at this time.

Abstract

Articulatory speech synthesis is a challenging task which requires mapping of time-varying articulatory trajectories and speech. In recent years, deep learning methods have been proposed for speech synthesis which have achieved significant progress towards human-like speech generation. However, articulatory speech synthesis is far from human-level performance. Thus, in this work, we further improve the results of articulatory speech synthesis to enhance synthesis quality. We consider a deep learning-based sequence-to-sequence baseline. We improve upon this network using a novel approach of labelaware contrastive learning using framewise phoneme alignment to learn better representations of the articulatory trajectories. With this approach, we obtain a relative improvement in Word Error Rate (WER) of 5.8% over the baseline. We also conduct mean opinion score (MOS) tests and other objective metrics to further evaluate our proposed models.

Bookmark

Cite This Study

Bandekar et al. (Sun,) studied this question.

synapsesocial.com/papers/68e59e92b6db643587538bbf https://doi.org/https://doi.org/10.21437/interspeech.2024-1756