August 14, 2025Open Access

Investigation of ASR Models for Low-Resource Kazakh Child Speech: Corpus Development, Model Adaptation, and Evaluation

Key Points

ESPnet achieved the highest accuracy at 32% for sentences, while Vosk excelled with 68% accuracy on short words.
The study evaluated four ASR models, with whisper and espnet showing notable performance improvements through corpus adaptation.
Data were collected using a custom Telegram bot and Dictaphone recordings, focusing on children aged 2 to 8 years.
Findings highlight the need for specialized acoustic corpora in low-resource environments supporting child speech technology.

Abstract

This study focuses on the development and evaluation of automatic speech recognition (ASR) systems for Kazakh child speech, an underexplored domain in both linguistic and computational research. A specialized acoustic corpus was constructed for children aged 2 to 8 years, incorporating age-related vocabulary stratification and gender variation to capture phonetic and prosodic diversity. The data were collected from three sources: a custom-designed Telegram bot, high-quality Dictaphone recordings, and naturalistic speech samples recorded in home and preschool environments. Four ASR models, Whisper, DeepSpeech, ESPnet, and Vosk, were evaluated. Whisper, ESPnet, and DeepSpeech were fine-tuned on the curated corpus, while Vosk was applied in its standard pretrained configuration. Performance was measured using five evaluation metrics: Word Error Rate (WER), BLEU, Translation Edit Rate (TER), Character Similarity Rate (CSRF2), and Accuracy. The results indicate that ESPnet achieved the highest accuracy (32%) and the lowest WER (0.242) for sentences, while Whisper performed well in semantically rich utterances (Accuracy = 33%; WER = 0.416). Vosk demonstrated the best performance on short words (Accuracy = 68%) and yielded the highest BLEU score (0.600) for short words. DeepSpeech showed moderate improvements in accuracy, particularly for short words (Accuracy = 60%), but faced challenges with longer utterances, achieving an Accuracy of 25% for sentences. These findings emphasize the critical importance of age-appropriate corpora and domain-specific adaptation when developing ASR systems for low-resource child speech, particularly in educational and therapeutic contexts.

Investigation of ASR Models for Low-Resource Kazakh Child Speech: Corpus Development, Model Adaptation, and Evaluation

Key Points

Abstract

Cite This Study