Los puntos clave no están disponibles para este artículo en este momento.
Synthesis of Text-to-speech (TTS) is a process that involves translating a natural language text into a speech. Speech synthesisers face a major challenge when recognizing the prosodic elements of written text, such as intonation (the rise and fall of the voice in speaking), and length. In contrast, continuous speech features are influenced by the personality and emotions of the artist. A database is maintained to store the synthesized speech pieces. Its output is determined by how similar the person utters the words and how capable they are of being implied. In the past few years, the field of text-to-speech synthesis has been heavily impacted by the emergence of deep learning, an AI technology that has gained widespread popularity. This review paper presents a taxonomy of models and architectures that are based on deep learning and discusses the various datasets that are utilised in the TTS process. It also covers the evaluation matrices that are commonly used. The paper ends with a look at the future directions of the system and reaches to some Deep learning models that give promising results in this field.
Ahmad et al. (Sat,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: