One of the main challenges faced by researchers in speech recognition is the limitation of data, especially for low-resource languages. A common strategy to improve a model's performance is to expand the data space through data augmentation techniques. Data augmentation has proven effective in increasing the amount of training data and reducing the mismatch between training and testing data. Furthermore, data augmentation is essential for improving the performance of deep neural networks by mitigating overfitting and enhancing the models' generalization capabilities. This study compares the impact of several standard augmentation techniques applied to low-resource dialect speech (time stretching, pitch shifting, noise addition, and gain) on speech recognition performance using a Speech-Transformer architecture. The dataset used consists of Indonesian dialectal speech. The results indicate that the average accuracy improvement in recognition was 57.6%, 57.9%, and 59.3% for Character Error Rate (CER), Word Error Rate (WER), and Sentence Error Rate (SER), respectively, compared to speech recognition without any data augmentation.
Endah et al. (Mon,) studied this question.