Key points are not available for this paper at this time.
With the advancements in deep learning and other techniques, synthetic speech is getting closer to a natural sounding voice. Some of the state-of-art technologies achieve such a high level of naturalness that even humans have difficulties distinguishing real speech from computer generated speech. Moreover, these technologies allow a person to train a speech synthesizer with a target voice, creating a model that is able to reproduce someone’s voice with high fidelity.In this paper, we introduce the FoR Dataset, which contains more than 198,000 utterances from the latest deep-learning speech synthesizers as well as real speech. This dataset can be used as base for several studies in speech synthesis and synthetic speech detection. Due to its large amount of utterances, it is pertinent for machine learning studies, since it is able to train even complex deep learning models without overfitting. We present several experiments using this dataset, including a deep learning classifier that reached up to 99.96% accuracy in synthetic speech detection.
Reimao et al. (Tue,) studied this question.