Key points are not available for this paper at this time.
Large-scale autoregressive text-to-speech (TTS) models can generate speech that is nearly indistinguishable from human speech. However, training large language models (LLMs) is challenging due to memory and computational constraints. This paper describes our TTS method for the 2024 Conversational Voice Clone Challenge (CoVoC). Our approach modifies the LauraGPT model to synthesize mixed Chinese and English text by expanding the Chinese pinyin vocabulary and reducing the number of layers in the decoder-only Transformer architecture. Despite using minimal training data, the performance gap between our method and other constrained systems is relatively small in both subjective and some objective evaluations. This paper discusses our attempt to train lightweight LLMs for zero-shot TTS and analyzes the factors contributing to low performance. Our audio samples can be accessed online11https://axunyi.github.io/lwllmtts/.
Wu et al. (Thu,) studied this question.