What question did this study set out to answer?

This research aims to improve continuous sign language recognition by enhancing video feature extraction and alignment with textual glosses.

June 2, 2026Open Access

A textual-guided compact visual attention features for continuous sign language recognition

Key Points

This research aims to improve continuous sign language recognition by enhancing video feature extraction and alignment with textual glosses.
Developed a model incorporating Cross-Modal Attention (CMA) and Compact Bilinear Pooling (CBP) modules.
Utilized datasets including RWTH-PHOENIX, Chinese Sign Language, and Doordarshan Continuous Indian Sign Language (DDCISL).
Employed Connectionist Temporal Classifier (CTC) for aligning predicted glosses into coherent sentences.
Achieved improved performance metrics on CSLR datasets, indicating enhanced recognition accuracy.
Demonstrated effective alignment of visual features with textual glosses, enhancing overall model performance.

Abstract

Continuous Sign Language Recognition (CSLR) datasets contain RGB videos annotated with similar sentence glosses representing sign sequences. However, the performance of CSLR systems is often hindered by issues such as redundant video frames, sparse text glosses, and the challenge of aligning visual and textual methods. This research presents a unique model that generates enhanced compact visual features based on Cross-Modal Attention (CMA) and Compact Bilinear Pooling (CBP) modules. The CMA module extracts Attended Visual Features (AVF) by optimizing cross-modal loss derived from gloss textual vectors, aligning visual features with pre-formulated text sequences. To address the high dimensionality of AVFs, we utilize the CBP module to combine them with gloss textual vectors, resulting in Textual-Guided Compact Visual Attention Features (TGCVAF) for classification. The predicted glosses are then hard aligned into meaningful sentences using a Connectionist Temporal Classifier (CTC). The effectiveness of the proposed model is demonstrated through large-scale experiments on datasets that include RWTH-PHOENIX, Chinese Sign Language, and our newly developed Doordarshan Continuous Indian Sign Language (DDCISL) dataset, which achieves improved performance over the latest techniques.

Ask AI

Helpful

Bookmark

View Full Paper