This research addresses the lack of annotated code-switched (CS) speech data for low-resource languages by proposing a scalable, language-independent system that utilizes transcribed online videos to collect diverse, noisy speech datasets. The proposed system aims to generate natural-sounding Egyptian Colloquial Arabic (ECA) speech and improve Automatic Speech Recognition (ASR) performance, particularly for code-switched utterances. To achieve this, we collected data from transcribed online videos to fine-tune a text-to-speech (TTS) model (XTTSv2) for ECA. Synthetic ECA text segments were generated using Large Language Models (LLMs) through in-context learning to mirror actual video transcriptions. These synthetic text segments were then converted into speech utterances using the fine-tuned XTTSv2 model. The resulting synthetic speech-text pairs were used to fine-tune an ASR model (Whisper), with code-switched data incorporated by generating synthetic speech samples with embedded English words and phrases within Arabic sentences. Experimental results on the Whisper ASR small model show that purely synthetic speech-text pairs reduced the Word Error Rate (WER) by approximately 40% on code-switched transcription tasks. In conclusion, our data augmentation method significantly improves ASR performance for low-resource languages, especially in regions like the Arab world where code-switching between Arabic dialects and English is frequent, particularly in Egyptian Arabic. This technique provides a valuable resource for developing ASR systems for less commonly spoken languages and dialects.
Morsi et al. (Thu,) studied this question.