What does this research mean for the field?

Fine-tuning Automatic Speech Recognition models with synthetic speech-text pairs generated via Large Language Models and text-to-speech systems reduces Word Error Rate by approximately 40% for low-resource, code-switched languages. Novelty: ClaimNovelty.METHODOLOGICAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

This research aims to improve automatic speech recognition (ASR) performance for low-resource languages, particularly with code-switching.

June 6, 2026Open Access

View Full Paper

Enhancing multilingual automatic speech recognition for low-resource code-switched languages: a scalable data augmentation strategy

MMMohab Mostafa Morsi RFRadwa FathallaArab Academy for Science, Technology, and Maritime Transport SASherif AbdouCairo University

Key Points

This research aims to improve automatic speech recognition (ASR) performance for low-resource languages, particularly with code-switching.
Proposed a scalable system using transcribed online videos for data collection.
Fine-tuned a text-to-speech model (XTTSv2) with synthetic text segments generated by large language models.
Utilized the synthetic speech-text pairs to enhance an ASR model (Whisper) for code-switched data.
Achieved a 40% reduction in Word Error Rate (WER) on code-switched transcription tasks using synthetic speech data.
Demonstrated that the method effectively supports languages with frequent code-switching, such as Egyptian Arabic.

Abstract

This research addresses the lack of annotated code-switched (CS) speech data for low-resource languages by proposing a scalable, language-independent system that utilizes transcribed online videos to collect diverse, noisy speech datasets. The proposed system aims to generate natural-sounding Egyptian Colloquial Arabic (ECA) speech and improve Automatic Speech Recognition (ASR) performance, particularly for code-switched utterances. To achieve this, we collected data from transcribed online videos to fine-tune a text-to-speech (TTS) model (XTTSv2) for ECA. Synthetic ECA text segments were generated using Large Language Models (LLMs) through in-context learning to mirror actual video transcriptions. These synthetic text segments were then converted into speech utterances using the fine-tuned XTTSv2 model. The resulting synthetic speech-text pairs were used to fine-tune an ASR model (Whisper), with code-switched data incorporated by generating synthetic speech samples with embedded English words and phrases within Arabic sentences. Experimental results on the Whisper ASR small model show that purely synthetic speech-text pairs reduced the Word Error Rate (WER) by approximately 40% on code-switched transcription tasks. In conclusion, our data augmentation method significantly improves ASR performance for low-resource languages, especially in regions like the Arab world where code-switching between Arabic dialects and English is frequent, particularly in Egyptian Arabic. This technique provides a valuable resource for developing ASR systems for less commonly spoken languages and dialects.

Ask AI

Helpful

Bookmark

View Full Paper

Ask AI

Helpful

Bookmark

View Full Paper

Enhancing multilingual automatic speech recognition for low-resource code-switched languages: a scalable data augmentation strategy

Key Points

Abstract

Cite This Study