Key points are not available for this paper at this time.
Continuous sign language recognition (SLR) deals with unaligned video-text pair and uses the word error rate (WER), i.e., edit distance, as the main evaluation metric. Since it is not differentiable, we usually instead optimize the learning model with the connectionist temporal classification (CTC) objective loss, which maximizes the posterior probability over the sequential alignment. Due to the optimization gap, the predicted sentence with the highest decoding probability may not be the best choice under the WER metric. To tackle this issue, we propose a novel architecture with cross modality augmentation. Specifically, we first augment cross-modal data by simulating the calculation procedure of WER, i.e., substitution, deletion and insertion on both text label and its corresponding video. With these real and generated pseudo video-text pairs, we propose multiple loss terms to minimize the cross modality distance between the video and ground truth label, and make the network distinguish the difference between real and pseudo modalities. The proposed framework can be easily extended to other existing CTC based continuous SLR architectures. Extensive experiments on two continuous SLR benchmarks, i.e., RWTH-PHOENIX-Weather and CSL, validate the effectiveness of our proposed method.
Building similarity graph...
Analyzing shared references across papers
Loading...
Junfu Pu
University of Science and Technology of China
Wengang Zhou
University of Science and Technology of China
Hezhen Hu
The University of Texas at Austin
University of Science and Technology of China
Building similarity graph...
Analyzing shared references across papers
Loading...
Pu et al. (Mon,) studied this question.
synapsesocial.com/papers/6a10b385cfa01e990d9f5a49 — DOI: https://doi.org/10.1145/3394171.3413931
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: