This study presents the first Text-to-Speech (TTS) model for Penang Hokkien, a low-resource tonal dialect at risk of extinction. To address phonological sparsity in the collected speech corpus, we propose a two-stage fine-tuning approach that emphasizes comprehensive phonetic coverage through syllable-level synthetic augmentation while subsequently refining prosodic naturalness using real speech recordings. By supplementing a limited 45-minute real speech corpus with a 2-hour syllable-level concatenative synthetic corpus, the full dialectal inventory of approximately 2,000 unique syllable-tone combinations was encompassed. Experimental results suggest that improving syllable-tone coverage contributes substantially to intelligibility and tonal accuracy in this low-resource tonal setting. Technical optimizations, including a 600-ms cross-fading technique to mitigate boundary artifacts and numerical tone markers to reduce token sparsity, further improved model stability and synthesis quality. The final model achieved a Mean Opinion Score (MOS) of 3.92.
Lai et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: