September 6, 2015

Learning the speech front-end with raw waveform CLDNNs

TSTara N. SainathMassachusetts Institute of Technology RWRon J. WeissMassachusetts Institute of Technology ASAndrew SeniorGoogle (United States)

Key Points

Key points are not available for this paper at this time.

Abstract

Learning an acoustic model directly from the raw waveform has been an active area of research. However, waveform-based models have not yet matched the performance of log-mel trained neural networks. We will show that raw wave-form features match the performance of log-mel filterbank ener-gies when used with a state-of-the-art CLDNN acoustic model trained on over 2,000 hours of speech. Specifically, we will show the benefit of the CLDNN, namely the time convolution layer in reducing temporal variations, the frequency convolution layer for preserving locality and reducing frequency variations, as well as the LSTM layers for temporal modeling. In addition, by stacking raw waveform features with log-mel features, we achieve a 3 % relative reduction in word error rate. 1.

AIに質問

Bookmark

View Full Paper

Cite This Study

Sainath et al. (Sun,) studied this question.

synapsesocial.com/papers/69fb895c6d730ca589dd5ba0 https://doi.org/https://doi.org/10.21437/interspeech.2015-1

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

AIに質問

Bookmark

View Full Paper