Despite rapid progress in Sign Language Translation (SLT), it remains unclear how input modality and sequence length affect translation quality and efficiency. We conducted a controlled comparison of three commonly used input types—raw video, pose keypoints, and pretrained features—under a shared encoder–decoder architecture and standardized training setup. We show that domain-adapted features perform best overall, while raw video outperforms zero-shot features and poses when domain adaptation is unavailable. By uniformly downsampling input sequences across modalities, we observe substantial gains in training speed and memory efficiency, with no degradation in translation quality. This reveals that SLT systems can safely operate with significantly fewer input tokens—enabling faster experimentation, lower compute requirements, and broader accessibility, and highlighting a promising direction for reducing training time and resource demands. Moreover, we show that all models maintain competitive performance under downsampling conditions, highlighting the viability of fully end-to-end SLT pipelines that do not rely on intermediate representations. We release all code, trained models, and preprocessing scripts at: https: //github. com/GerrySant/multimodalhugs/tree/modalityₘatters-sltat2025
Building similarity graph...
Analyzing shared references across papers
Loading...
Gerard Sant
University of Zurich
Amit Moryossef
University of Zurich
Mathias Müller
Building similarity graph...
Analyzing shared references across papers
Loading...
Sant et al. (Tue,) studied this question.
synapsesocial.com/papers/69ada885bc08abd80d5bb959 — DOI: https://doi.org/10.5167/uzh-292809