What question did this study set out to answer?

This research aims to explore the effects of automatic prosodic segmentation on dataset construction for text-to-speech (TTS) systems in Brazilian Portuguese.

February 5, 2026Open Access

Investigating the effect of automatic prosodic segmentation on speech synthesis for Brazilian Portuguese

Key Points

This research aims to explore the effects of automatic prosodic segmentation on dataset construction for text-to-speech (TTS) systems in Brazilian Portuguese.
Analyzed the CORAA NURC-SP Minimal Corpus dataset comprised of approximately 17 hours of spontaneous speech.
Utilized three segmentation methods: manual segmentation, WhisperX automatic segmentation, and a machine learning approach.
Trained a speech synthesis model using FastSpeech2 with the segmented datasets.
Speech synthesized from automatically segmented data closely resembles manually segmented speech in F0 curve representation.
Synthetic speech shows increased variability in tonal events and prosodic focus compared to natural speech.
70% of synthesized nuclear contours differed from those in natural speech, indicating limitations of automatic segmentation.

Abstract

This paper has been accepted for presentation at Speech Prosody 2026. ABSTRACT: Although automatic methods of prosodic segmentation have re- cently been proposed, their effect on the construction of datasets for TTS training is still unknown. For the first time in Brazilian Portuguese, we investigated this type of effect on the CORAA NURC-SP Minimal Corpus dataset, consisting of ≈ 17h35m of spontaneous speech, which was segmented using an auto- matic prosodic segmenter to train a speech synthesis model. We comparatively analyzed natural speech and speech synthesized by FastSpeech2 under three segmentation conditions: manual prosodic segmentation, WhisperX automatic segmentation, and a machine learning prosodic segmentation method. The results of the acoustic prosodic analysis revealed that speech synthe- sized from a dataset with automatic prosodic segmentation ap- proximates speech generated with manually segmented data, considering the representation of the F0 curve. Nevertheless, in a phonological analysis, synthetic speech exhibited a higher variability in tonal events and prosodic focus, as was also ob- served by Hu et al. (2024) for Southern British English. Fur- thermore, 70% of synthesized nuclear contours differed from the nuclear contours of natural speech. We attribute these is- sues, among other factors, to the fact that automatic segmenta- tion does not capture systematically pauses and F0 variations, which delimit intonational units, unlike manual segmentation.

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper

Cite This Study

Galdino et al. (Mon,) studied this question.

synapsesocial.com/papers/698434ebf1d9ada3c1fb3999 https://doi.org/https://doi.org/10.5281/zenodo.18457768

KI fragen

Bookmark

View Full Paper