Sign language recognition (SLR) has emerged as a crucial technology for enablingcommunication between hearing-impaired individuals and the wider community. However,many existing methods rely on single-modal inputs such as RGB video or skeletal data, whichoften struggle to perform reliably under real-world conditions involving occlusions,illumination changes, and complex gesture patterns.This work presents a multimodal deep learning framework that combines RGB visualinformation, human pose landmarks, and detailed hand key points to effectively capture bothspatial structure and temporal motion in sign language gestures. The proposed systemintegrates Convolutional Neural Networks (CNNs) for spatial feature extraction, TemporalConvolutional Networks (TCNs) for sequence modelling, and Transformer-based attentionmechanisms to learn long-range dependencies across frames. In addition, an adaptive attentiondriven fusion module is introduced to combine features from multiple modalities dynamically.The model is trained and evaluated on the AUTSL dataset containing 100 gesture classes.Experimental evaluation demonstrates that the proposed approach achieves strong recognitionperformance and shows clear improvements over unimodal baselines. Furthermore, aStreamlit-based interface is developed to enable real-time interaction and practical usability.Overall, the results highlight the effectiveness of combining multimodal representations withattention mechanisms for building robust and scalable sign language recognition systems.
Building similarity graph...
Analyzing shared references across papers
Loading...
Swapnil Ohol
Building similarity graph...
Analyzing shared references across papers
Loading...
Swapnil Ohol (Tue,) studied this question.
synapsesocial.com/papers/6a0ea196be05d6e3efb605ce — DOI: https://doi.org/10.5281/zenodo.20286329