Key points are not available for this paper at this time.
This paper presents IP-SLT, a simple yet effective framework for sign language translation (SLT). Our IP-SLT adopts a recurrent structure and enhances the semantic representation (prototype) of the input sign language video via an iterative refinement manner. Our idea mimics the behavior of human reading, where a sentence can be digested repeatedly, till reaching accurate understanding. Technically, IP-SLT consists of feature extraction, prototype initialization, and iterative prototype refinement. The initialization module generates the initial prototype based on the visual feature extracted by the feature extraction module. Then, the iterative refinement module leverages the cross-attention mechanism to polish the previous prototype by aggregating it with the original video feature. Through repeated refinement, the prototype finally converges to a more stable and accurate state, leading to a fluent and appropriate translation. In addition, to leverage the sequential dependence of prototypes, we further propose an iterative distillation loss to compress the knowledge of the final iteration into previous ones. As the autoregressive decoding process is executed only once in inference, our IP-SLT is ready to improve various SLT systems with acceptable overhead. Extensive experiments are conducted on public benchmarks to demonstrate the effectiveness of the IP-SLT.
Building similarity graph...
Analyzing shared references across papers
Loading...
Huijie Yao
University of Science and Technology of China
Wengang Zhou
University of Science and Technology of China
Hao Feng
Jingdezhen Ceramic Institute
University of Science and Technology of China
Institute of Art
National Science Centre
Building similarity graph...
Analyzing shared references across papers
Loading...
Yao et al. (Sun,) studied this question.
synapsesocial.com/papers/6a1c1cce00ee29383e9d77e6 — DOI: https://doi.org/10.1109/iccv51070.2023.01429