What question did this study set out to answer?

This work aims to develop a robust sign language recognition system using multimodal deep learning techniques.

May 21, 2026Open Access

Multimodal Attention-Based Framework for Robust Sign Language Recognition Using Deep Learning

Key Points

This work aims to develop a robust sign language recognition system using multimodal deep learning techniques.
Designed a multimodal framework combining RGB video, human pose landmarks, and hand key points.
Utilized Convolutional Neural Networks for spatial feature extraction, Temporal Convolutional Networks for sequence modeling, and Transformer attention mechanisms.
Evaluated on the AUTSL dataset featuring 100 gesture classes.
Achieved strong recognition performance with significantly improved accuracy compared to unimodal methods.
Demonstrated effective feature fusion through adaptive attention-driven mechanisms.
Enabled real-time interaction via a custom Streamlit-based interface.

Abstract

Sign language recognition (SLR) has emerged as a crucial technology for enablingcommunication between hearing-impaired individuals and the wider community. However,many existing methods rely on single-modal inputs such as RGB video or skeletal data, whichoften struggle to perform reliably under real-world conditions involving occlusions,illumination changes, and complex gesture patterns.This work presents a multimodal deep learning framework that combines RGB visualinformation, human pose landmarks, and detailed hand key points to effectively capture bothspatial structure and temporal motion in sign language gestures. The proposed systemintegrates Convolutional Neural Networks (CNNs) for spatial feature extraction, TemporalConvolutional Networks (TCNs) for sequence modelling, and Transformer-based attentionmechanisms to learn long-range dependencies across frames. In addition, an adaptive attentiondriven fusion module is introduced to combine features from multiple modalities dynamically.The model is trained and evaluated on the AUTSL dataset containing 100 gesture classes.Experimental evaluation demonstrates that the proposed approach achieves strong recognitionperformance and shows clear improvements over unimodal baselines. Furthermore, aStreamlit-based interface is developed to enable real-time interaction and practical usability.Overall, the results highlight the effectiveness of combining multimodal representations withattention mechanisms for building robust and scalable sign language recognition systems.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Swapnil Ohol

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Multimodal Attention-Based Framework for Robust Sign Language Recognition Using Deep Learning

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study