What question did this study set out to answer?

The study aims to develop a real-time system for recognizing Arabic sign language gestures and converting them to text.

May 12, 2026Open Access

Arabic Sign Language Recognition with Deep Learning Models and Keypoint Landmarks

Key Points

The study aims to develop a real-time system for recognizing Arabic sign language gestures and converting them to text.
Utilized OpenCV and MediaPipe for keypoint extraction from hands, face, and upper body.
Developed deep learning models including CNN, GRU, LSTM, and Bi-LSTM using a dataset of 112 Arabic sign language words.
Bi-LSTM model achieved the highest performance metrics with training accuracy of 99.89% and testing accuracy of 99.61%.
Bi-LSTM model outperformed other models with a training accuracy of 99.89% and testing accuracy of 99.61%.
Real-time sign language recognition system can convert Arabic gestures into text successfully.
Demonstrated effectiveness of MediaPipe-based features in supporting communication for Arabic-speaking deaf communities.

Abstract

Communication is a fundamental aspect of human interaction, essential for expressing emotions and building relationships. While individuals with typical hearing rely on spoken language, the deaf and mute community communicates through visual gestures and facial expressions, commonly known as sign language. However, communication barriers persist between hearing and non-hearing individuals, especially in regions with limited assistive technologies. To address this gap, we developed a real-time sign language system that converts Arabic sign gestures into textual output. Unlike most existing systems that are limited to individual alphabets or numbers, our model recognizes complete, meaningful words. It was trained on a curated dataset of 112 Arabic sign language words extracted from the KARSL dataset. Using OpenCV and the MediaPipe framework, multimodal keypoints from hands, face, and upper-body pose were extracted. MediaPipe Hands generated a 255-dimensional feature vector for each video frame, capturing real-time hand movements. These features were used to train deep learning models—CNN, GRU, LSTM, and Bi-LSTM. Among these, the Bi-LSTM model achieved the highest performance with a training accuracy of 99.89% and testing accuracy of 99.61%. These results emphasize the potential of MediaPipe-based landmark extraction combined with deep learning to support accessible communication for Arabic-speaking deaf communities.

Arabic Sign Language Recognition with Deep Learning Models and Keypoint Landmarks

Key Points

Abstract

Cite This Study