This study focuses on the development and evaluation of automatic speech recognition (ASR) systems for Kazakh child speech, an underexplored domain in both linguistic and computational research. A specialized acoustic corpus was constructed for children aged 2 to 8 years, incorporating age-related vocabulary stratification and gender variation to capture phonetic and prosodic diversity. The data were collected from three sources: a custom-designed Telegram bot, high-quality Dictaphone recordings, and naturalistic speech samples recorded in home and preschool environments. Four ASR models, Whisper, DeepSpeech, ESPnet, and Vosk, were evaluated. Whisper, ESPnet, and DeepSpeech were fine-tuned on the curated corpus, while Vosk was applied in its standard pretrained configuration. Performance was measured using five evaluation metrics: Word Error Rate (WER), BLEU, Translation Edit Rate (TER), Character Similarity Rate (CSRF2), and Accuracy. The results indicate that ESPnet achieved the highest accuracy (32%) and the lowest WER (0.242) for sentences, while Whisper performed well in semantically rich utterances (Accuracy = 33%; WER = 0.416). Vosk demonstrated the best performance on short words (Accuracy = 68%) and yielded the highest BLEU score (0.600) for short words. DeepSpeech showed moderate improvements in accuracy, particularly for short words (Accuracy = 60%), but faced challenges with longer utterances, achieving an Accuracy of 25% for sentences. These findings emphasize the critical importance of age-appropriate corpora and domain-specific adaptation when developing ASR systems for low-resource child speech, particularly in educational and therapeutic contexts.
Rakhimova et al. (Thu,) studied this question.