Key points are not available for this paper at this time.
Scaling Text-to-speech (TTS) to large-scale datasets has been demonstrated as an effective method for improving the diversity and naturalness of synthesized speech. At the high level, previous large-scale TTS models can be categorized into either Auto-regressive (AR) based (e. g. , VALL-E) or Non-auto-regressive (NAR) based models (e. g. , NaturalSpeech 2/3). Although these works demonstrate good performance, they still have potential weaknesses. For instance, AR-based models are plagued by unstable generation quality and slow generation speed; meanwhile, some NAR-based models need phoneme-level duration alignment information, thereby increasing the complexity of data pre-processing, model design, and loss design. In this work, we build upon our previous publication by implementing a simple and efficient non-autoregressive (NAR) TTS framework, termed SimpleSpeech 2. SimpleSpeech 2 effectively combines the strengths of both autoregressive (AR) and non-autoregressive (NAR) methods, offering the following key advantages: (1) simplified data preparation; (2) straightforward model and loss design; and (3) stable, high-quality generation performance with fast inference speed. Compared to our previous publication, we present (1) a detailed analysis of the influence of speech tokenizer and noisy label for TTS performance; (2) four distinct types of sentence duration predictors; (3) a novel flow-based scalar latent transformer diffusion model. With these improvement, we show a significant improvement in generation performance and generation speed compared to our previous work and other state-of-the-art (SOTA) large-scale TTS models. Furthermore, we show that SimpleSpeech 2 can be seamlessly extended to multilingual TTS by training it on multilingual speech datasets. Demos are available on: https: //dongchaoyang. top/SimpleSpeech2\demo/.
Building similarity graph...
Analyzing shared references across papers
Loading...
Dongchao Yang
Chinese University of Hong Kong
Rongjie Huang
Guangxi Medical University
Yuanyuan Wang
Northeast Agricultural University
Building similarity graph...
Analyzing shared references across papers
Loading...
Yang et al. (Sun,) studied this question.
synapsesocial.com/papers/68e5b027b6db643587549d53 — DOI: https://doi.org/10.48550/arxiv.2408.13893