What question did this study set out to answer?

The research aims to develop a novel framework for translating sign language directly into written language without relying on gloss annotations.

February 21, 2026Open Access

H andscribe : A gloss-free framework for sign language translation and gloss sequence generation

Read Full Paperexternally

Key Points

The research aims to develop a novel framework for translating sign language directly into written language without relying on gloss annotations.
Introduced a two-stage translation process using spatiotemporal features and a frozen mBART model.
Translated videos of sign language into written sentences.
Generated gloss sequences from sentences using a fine-tuned Large Language Model (LLaMa3.1-8B-Instruct).
Utilized weak supervision to eliminate the need for gloss-level supervision.
Demonstrated strong translation performance in both PHOENIX-2014-T and Wav2Gloss benchmarks.
Achieved state-of-the-art multilingual gloss generation, even in zero-shot conditions.
Reduced the need for time-consuming manual gloss annotation.

Abstract

Sign language translation systems traditionally rely on intermediate gloss representations to bridge the gap between visual input and written language output. However, manual gloss annotation is costly, language-dependent, and often lossy, prompting growing interest in gloss-free alternatives. This paper introduces H andscribe , a novel two-stage framework for gloss-free sign language translation and gloss sequence generation. H andscribe first translates continuous sign language videos into written language sentences using a lightweight decoder built atop SlowFast-based spatiotemporal features and a frozen mBART model. Then, in the second stage, it generates gloss sequences from these sentences using a Large Language Model (LLaMa3.1-8B-Instruct) that has been fine-tuned with weak supervision. Our experiments on PHOENIX-2014-T and Wav2Gloss Fieldwork demonstrate strong translation performance and state-of-the-art multilingual gloss generation, even in zero-shot settings. The proposed framework reduces annotation bottlenecks while maintaining flexibility and interpretability, paving the way for scalable and inclusive sign language technologies. The code and fine-tuning scripts are available at https://github.com/colonnaemanuele/Handscribe . • We propose a gloss-free framework for sign language translation and gloss sequence generation. • Our method leverages SlowFast features and a frozen mBART decoder. • Glosses are inferred post-translation using a fine-tuned Large Language Model. • The approach eliminates the need for gloss-level supervision during training. • We report strong results on PHOENIX-2014-T and Wav2Gloss benchmarks.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Emanuele Colonna

Ivan Rinaldi

David Landi

Journals

Computer Vision and Image Understanding

Actions

Institutions

University of Bari Aldo Moro

University of Siena

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

H andscribe : A gloss-free framework for sign language translation and gloss sequence generation

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study