Key points are not available for this paper at this time.
Children's automatic speech recognition (ASR) poses a significant challenge due to the high variability nature of children's speech. The limited availability of training datasets hampers the effective modelling of this variability, which can be partially addressed using a text-to-speech (TTS) system for data augmentation. However, generated data may contain imperfections, potentially impacting performance. In this work, we use Adapters to handle the domain mismatch when fine-tuning with TTS data. This involves a two-step training process: training adapter layers with a frozen pre-trained model using synthetic data, then fine-tuning both adapters and the entire model with a mix of synthetic and real data, where only synthetic data passes through the adapters. Experimental results demonstrate up to 6% relative reduction in WER compared to the straightforward use of synthetic data, indicating the effectiveness of adapter-based architectures in learning from imperfect synthetic data.
Building similarity graph...
Analyzing shared references across papers
Loading...
Thomas Rolland
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento
Alberto Abad
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento
Building similarity graph...
Analyzing shared references across papers
Loading...
Rolland et al. (Mon,) studied this question.
synapsesocial.com/papers/68e7398bb6db6435876b2fbd — DOI: https://doi.org/10.1109/icassp48485.2024.10446889