Los puntos clave no están disponibles para este artículo en este momento.
End-to-end models for raw audio generation are a challenge, specially if they to work with non-parallel data, which is a desirable setup in many. Voice conversion, in which a model has to impersonate a speaker in recording, is one of those situations. In this paper, we propose Blow, a-scale normalizing flow using hypernetwork conditioning to perform-to-many voice conversion between raw audio. Blow is trained end-to-end, non-parallel data, on a frame-by-frame basis using a single speaker. We show that Blow compares favorably to existing flow-based and other competitive baselines, obtaining equal or better in both objective and subjective evaluations. We further assess the of its main components with an ablation study, and quantify a number of such as the necessary amount of training data or the preference for or target speakers.
Serrà et al. (Mon,) studied this question.