Key points are not available for this paper at this time.
Pre-training has been proven to be effective in boosting the performance of Isolated Sign Language Recognition (ISLR). Existing pre-training methods solely focus on the compact pose data, which eliminates background perturbation but inevitably suffers from insufficient semantic cues compared to raw RGB videos. Nevertheless, learning representation directly from RGB videos remains challenging due to the presence of sign-irrelevant visual features. To address this dilemma, we propose a Cross-modal Consistency Learning framework (CCL-SLR), which leverages crossmodal consistency between both RGB and pose modalities in a self-supervised paradigm. First, CCL-SLR employs contrastive learning for instance discrimination within and across modalities. Through single-modal and cross-modal contrastive learning, CCL-SLR gradually aligns the feature spaces of RGB and pose modalities, thereby extracting consistency sign representation. Second, we further introduce Motion-Preserving Masking (MPM) and Semantic Positive Mining (SPM) techniques to improve cross-modal consistency from the perspective of data augmentation and sample clustering, respectively. Extensive experiments on four ISLR benchmarks show that CCL-SLR achieves impressive performance, demonstrating its effectiveness. The code is available at https://github.com/dueToLife/CCL-SLR.
Building similarity graph...
Analyzing shared references across papers
Loading...
Kepeng Wu
University of Science and Technology of China
Zecheng Li
University of Science and Technology of China
Weichao Zhao
University of Science and Technology of China
The University of Texas at Austin
University of Science and Technology of China
Building similarity graph...
Analyzing shared references across papers
Loading...
Wu et al. (Wed,) studied this question.
synapsesocial.com/papers/6a10b384cfa01e990d9f5a39 — DOI: https://doi.org/10.1109/cvprw67362.2025.00392