September 13, 2019

RawNet: Advanced End-to-End Deep Neural Network Using Raw Waveforms for Text-Independent Speaker Verification

Key Points

Key points are not available for this paper at this time.

Abstract

Recently, direct modeling of raw waveforms using deep neural networks has been widely studied for a number of tasks in audio domains.In speaker verification, however, utilization of raw waveforms is in its preliminary phase, requiring further investigation.In this study, we explore end-to-end deep neural networks that input raw waveforms to improve various aspects: front-end speaker embedding extraction including model architecture, pre-training scheme, additional objective functions, and back-end classification.Adjustment of model architecture using a pre-training scheme can extract speaker embeddings, giving a significant improvement in performance.Additional objective functions simplify the process of extracting speaker embeddings by merging conventional two-phase processes: extracting utterance-level features such as i-vectors or x-vectors and the feature enhancement phase, e.g., linear discriminant analysis.Effective back-end classification models that suit the proposed speaker embedding are also explored.We propose an end-toend system that comprises two deep neural networks, one frontend for utterance-level speaker embedding extraction and the other for back-end classification.Experiments conducted on the VoxCeleb1 dataset demonstrate that the proposed model achieves state-of-the-art performance among systems without data augmentation.The proposed system is also comparable to the state-of-the-art x-vector system that adopts data augmentation.

Bookmark

Cite This Study

Jung et al. (Fri,) studied this question.

synapsesocial.com/papers/6a1bff07c97d63156a5f31f4 https://doi.org/https://doi.org/10.21437/interspeech.2019-1982

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Bookmark