Speech-to-Text (STT) systems, despite their stellar performance in recent years, still struggle with recognising non-Western English accents and speech that features Code-Switching (CS), a linguistic phenomenon common in regions such as Nigeria. This study addresses that challenge for Nigerian English and Yoruba-English code-switched speech by adapting Mozilla’s DeepSpeech 0.9.3 model and fine-tuning it using a custom dataset of 118 minutes (approximately 1.97 hours). This process involved transfer learning and hyperparameter optimisation over iterative training sessions on a CPU-based setup. The model’s performance was evaluated using Word Error Rate (WER) and Character Error Rate (CER), with the best model showing modest improvements over the baseline model and achieving a WER of 0.760261 and CER of 0.381241 after 55 epochs. Although limited computing resources and the small dataset imposed significant constraints on the work, the study demonstrated the potential of fine-tuning and transfer learning for model adaptation to low-resource languages and code-switching contexts. Future work will require access to GPU resources for improved convergence and transcription accuracy, an expanded dataset and support for Yoruba diacritics to improve the quality of transcriptions.
Olorunshola et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: