July 30, 2025Open Access

Speech Recognition and Synthesis Models and Platforms for the Kazakh Language

Puntos clave

The analysis identifies significant gaps in the language infrastructure for the Kazakh language in ASR and TTS technologies.
Existing ASR and TTS systems struggle to accommodate unique phonetic and morphological features of the Kazakh language.
Testing encompassed both open-source and commercial solutions, examining their effectiveness in speech recognition and synthesis.
A model was found that achieves high accuracy in speech recognition without being trained on local data, reflecting promising outcomes.

Resumen

With the rapid development of artificial intelligence and machine learning technologies, automatic speech recognition (ASR) and Text-to-Speech (TTS) are becoming key components of the digital transformation of society. The Kazakh language, a representative of the Turkic language family, is a low-resource language. The analysis shows significant limitations in the availability of key components of the language infrastructure, namely audio corpora, language models, and high-quality speech synthesis systems. However, for the Kazakh language, which has unique phonetic, morphological, and syntactic features, the level of development of ASR/TTS technologies still lags significantly behind their counterparts for widely spoken languages. This study aims to comprehensively analyze existing speech recognition and text-to-speech models and platforms, emphasizing their applicability and adaptation to the Kazakh language. Particular attention is paid to the linguistic and technical barriers that hinder the effective integration of the Kazakh language into modern voice technologies, including the agglutinative structure, rich vowel system, and phonemic variability. This study aims to comprehensively evaluate a diverse range of existing Speech-to-Text (STT) and TTS models and platforms in terms of their applicability to the Kazakh language. We have tested both open and commercial solutions, including Whisper, GPT-4 Transcribe, ElevenLabs, OpenAI TTS, Voiser, KazakhTTS2, TurkicTTS, and others. The assessment of speech recognition was based on the WER, TER, chrF, BLEU, and COMET metrics. In contrast, speech synthesis was evaluated using MCD, PESQ, STOI, and DNSMOS, which cover both lexical-semantic and acoustic-perceptual characteristics. Based on the analysis, it has selected the most accurate universal STT system model, not trained on local data, which demonstrated high accuracy and semantic proximity. In the field, we have identified a model that combines minimal spectral distortions with high subjective sound quality.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo

Cite This Study

Karibayeva et al. (Mon,) studied this question.

synapsesocial.com/papers/689a0945e6551bb0af8cefd4 https://doi.org/https://doi.org/10.20944/preprints202507.2282.v1

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Me gusta

Guardar

Ver artículo completo