Speech-to-text translation (S2TT) for low-resource languages remains challenging due to the scarcity of parallel speech translation data and the susceptibility of modular pipelines to error propagation. This paper presents a controlled empirical comparison of cascade and end-to-end approaches for Kazakh–Russian speech translation using the ST-kk-ru dataset (≈332 h, 140 k triplets). The cascade framework is strengthened with recent pre-trained models for automatic speech recognition and neural machine translation, achieving 21.3 BLEU on the test set. Three representative end-to-end architectures are evaluated under identical data conditions. The strongest direct model, combining a Wav2Vec 2.0 encoder with an mBART decoder augmented by a length adaptor and adapter modules, reaches 17.97 BLEU, compared with 15.35 BLEU for FAIRSEQ S2T and 16.3 BLEU for ESPnet-ST. Automatic evaluation is complemented by expert manual assessment and targeted linguistic analysis. Results indicate that, under current low-resource conditions, cascade systems provide higher translation accuracy and better morpho-syntactic fidelity, while end-to-end models remain competitive and offer advantages in architectural simplicity and potentially reduced inference latency (due to single-pass processing), although empirical measurements were not conducted in this study. This study establishes a reproducible benchmark for Kazakh–Russian speech translation and highlights practical trade-offs between modeling paradigms in low-resource, morphologically rich settings.
Zhanibek Kozhirbayev (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: