What question did this study set out to answer?

March 18, 2026Open Access

Speech-to-Sign Gesture Translation for Kazakh: Dataset and Sign Gesture Translation System

Key Points

The aim is to develop a speech-to-sign language translation system for Kazakh Sign Language, addressing resource limitations.
Developed a prototype using the NVIDIA FastConformer model for automatic speech recognition.
Created the first Kazakh Sign Language dataset with 1200 manually recreated signs.
Implemented a multi-stage pipeline to convert speech to text, segmented phrases, and matched gestures.
Evaluated system performance using word error rates and accuracy metrics.
Achieved an average word error rate of 10.55% for automatic speech recognition.
The system exhibited 85% accuracy for individual words and 70% for sentences.
At the phrase level, the accuracy was 92.1% for unigrams and 78.3% for trigrams.
Average latency of the system was 310 ms.

Abstract

This paper presents the first prototype of a speech-to-sign language translation system for Kazakh Sign Language (KRSL). The proposed pipeline integrates the NVIDIA FastConformer model for automatic speech recognition (ASR) in the Kazakh language and addresses the challenges of sign language translation in a low-resource setting. Unlike American or British Sign Languages, KRSL lacks publicly available datasets and established translation systems. The pipeline follows a multi-stage process: speech input is converted into text via ASR, segmented into phrases, matched with corresponding gestures, and visualized as sign language. System performance is evaluated using word error rate (WER) for ASR and accuracy metrics for speech-to-sign translation. This study also introduces the first KRSL dataset, consisting of 1200 manually recreated signs, including 95% static images and 5% dynamic gesture videos. To improve robustness under resource-constrained conditions, a Weighted Hybrid Similarity Score (WHSS)-based gesture matching method is proposed. Experimental results show that the FastConformer model achieves an average WER of 10.55%, with 7.8% for isolated words and 13.3% for full sentences. At the phrase level, the system achieves 92.1% accuracy for unigrams, 84.6% for bigrams, and 78.3% for trigrams. The complete pipeline reaches 85% accuracy for individual words and 70% for sentences, with an average latency of 310 ms. These results demonstrate the feasibility and effectiveness of the proposed system for supporting people with hearing and speech impairments in Kazakhstan.

Speech-to-Sign Gesture Translation for Kazakh: Dataset and Sign Gesture Translation System

Key Points

Abstract

Cite This Study