The present paper presents a methodological framework to construct a sign language recognition system that employs a Temporal Convolutional Network (TCN) encoder combined with a Transformer-based decoder. The proposed approach automatically converts gesture video sequences into textual outputs while preserving both temporal dynamics and spatial structure. MediaPipe is utilised to extract 3D coordinates of 225 keypoints from each frame, and these features are pre-processed to facilitate efficient model training. This architecture was experimentally evaluated on the Kazakh Russian sign language (KRSL) corpus and demonstrated its applicability to practical gesture recognition scenarios. This study addresses the core issues of sign language recognition: diversity of sign language users, lack of training data, and lack of pre-learned models for low-resource languages. Overall, the method advances inclusive communication technologies, promoting more accessible interaction for people with speech and hearing impairments and supporting a range of inclusive applications.
Yerimbetova et al. (Thu,) studied this question.