Communication is a fundamental aspect of human interaction, essential for expressing emotions and building relationships. While individuals with typical hearing rely on spoken language, the deaf and mute community communicates through visual gestures and facial expressions, commonly known as sign language. However, communication barriers persist between hearing and non-hearing individuals, especially in regions with limited assistive technologies. To address this gap, we developed a real-time sign language system that converts Arabic sign gestures into textual output. Unlike most existing systems that are limited to individual alphabets or numbers, our model recognizes complete, meaningful words. It was trained on a curated dataset of 112 Arabic sign language words extracted from the KARSL dataset. Using OpenCV and the MediaPipe framework, multimodal keypoints from hands, face, and upper-body pose were extracted. MediaPipe Hands generated a 255-dimensional feature vector for each video frame, capturing real-time hand movements. These features were used to train deep learning models—CNN, GRU, LSTM, and Bi-LSTM. Among these, the Bi-LSTM model achieved the highest performance with a training accuracy of 99.89% and testing accuracy of 99.61%. These results emphasize the potential of MediaPipe-based landmark extraction combined with deep learning to support accessible communication for Arabic-speaking deaf communities.
Alshanik et al. (Fri,) studied this question.