December 1, 2014

Voice conversion using deep neural networks with speaker-independent pre-training

Key Points

Key points are not available for this paper at this time.

Abstract

In this study, we trained a deep autoencoder to build compact representations of short-term spectra of multiple speakers. Using this compact representation as mapping features, we then trained an artificial neural network to predict target voice features from source voice features. Finally, we constructed a deep neural network from the trained deep autoencoder and artificial neural network weights, which were then fine-tuned using back-propagation. We compared the proposed method to existing methods using Gaussian mixture models and frame-selection. We evaluated the methods objectively, and also conducted perceptual experiments to measure both the conversion accuracy and speech quality of selected systems. The results showed that, for 70 training sentences, frame-selection performed best, regarding both accuracy and quality. When using only two training sentences, the pre-trained deep neural network performed best, regarding both accuracy and quality.

KI fragen

Bookmark

KI fragen

Bookmark

Voice conversion using deep neural networks with speaker-independent pre-training

Key Points

Abstract

Cite This Study