Continuous Sign Language Recognition (CSLR) datasets contain RGB videos annotated with similar sentence glosses representing sign sequences. However, the performance of CSLR systems is often hindered by issues such as redundant video frames, sparse text glosses, and the challenge of aligning visual and textual methods. This research presents a unique model that generates enhanced compact visual features based on Cross-Modal Attention (CMA) and Compact Bilinear Pooling (CBP) modules. The CMA module extracts Attended Visual Features (AVF) by optimizing cross-modal loss derived from gloss textual vectors, aligning visual features with pre-formulated text sequences. To address the high dimensionality of AVFs, we utilize the CBP module to combine them with gloss textual vectors, resulting in Textual-Guided Compact Visual Attention Features (TGCVAF) for classification. The predicted glosses are then hard aligned into meaningful sentences using a Connectionist Temporal Classifier (CTC). The effectiveness of the proposed model is demonstrated through large-scale experiments on datasets that include RWTH-PHOENIX, Chinese Sign Language, and our newly developed Doordarshan Continuous Indian Sign Language (DDCISL) dataset, which achieves improved performance over the latest techniques.
Prathyusha et al. (Sun,) studied this question.