With the rapid development of artificial intelligence and machine learning technologies, automatic speech recognition (ASR) and Text-to-Speech (TTS) are becoming key components of the digital transformation of society. The Kazakh language, a representative of the Turkic language family, is a low-resource language. The analysis shows significant limitations in the availability of key components of the language infrastructure, namely audio corpora, language models, and high-quality speech synthesis systems. However, for the Kazakh language, which has unique phonetic, morphological, and syntactic features, the level of development of ASR/TTS technologies still lags significantly behind their counterparts for widely spoken languages. This study aims to comprehensively analyze existing speech recognition and text-to-speech models and platforms, emphasizing their applicability and adaptation to the Kazakh language. Particular attention is paid to the linguistic and technical barriers that hinder the effective integration of the Kazakh language into modern voice technologies, including the agglutinative structure, rich vowel system, and phonemic variability. This study aims to comprehensively evaluate a diverse range of existing Speech-to-Text (STT) and TTS models and platforms in terms of their applicability to the Kazakh language. We have tested both open and commercial solutions, including Whisper, GPT-4 Transcribe, ElevenLabs, OpenAI TTS, Voiser, KazakhTTS2, TurkicTTS, and others. The assessment of speech recognition was based on the WER, TER, chrF, BLEU, and COMET metrics. In contrast, speech synthesis was evaluated using MCD, PESQ, STOI, and DNSMOS, which cover both lexical-semantic and acoustic-perceptual characteristics. Based on the analysis, it has selected the most accurate universal STT system model, not trained on local data, which demonstrated high accuracy and semantic proximity. In the field, we have identified a model that combines minimal spectral distortions with high subjective sound quality.
Karibayeva et al. (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: