What type of study is this?

This is a Quantitative Study study.

October 2, 2025Open Access

Direct Simultaneous Translation Activation for Large Audio-Language Models

Key Points

Augmenting just 1% of data activates Simul-S2TT capabilities in large audio-language models.
Incorporating simultaneous data into offline training significantly bridges distribution gaps during translation.
The SimulSA method allows real-time translation without needing to modify existing model architectures.
Experimental results highlight that minimal data augmentation can produce substantial advancements in translation performance.

Abstract

Simultaneous speech-to-text translation (Simul-S2TT) aims to translate speech into target text in real time, outputting translations while receiving source speech input, rather than waiting for the entire utterance to be spoken. Simul-S2TT research often modifies model architectures to implement read-write strategies. However, with the rise of large audio-language models (LALMs), a key challenge is how to directly activate Simul-S2TT capabilities in base models without additional architectural changes. In this paper, we introduce Simultaneous Self- Augmentation (SimulSA), a strategy that utilizes LALMs' inherent capabilities to obtain simultaneous data by randomly truncating speech and constructing partially aligned translation. By incorporating them into offline SFT data, SimulSA effectively bridges the distribution gap between offline translation during pretraining and simultaneous translation during inference. Experimental results demonstrate that augmenting only about 1\% of the simultaneous data, compared to the full offline SFT data, can significantly activate LALMs' Simul-S2TT capabilities without modifications to model architecture or decoding strategy.

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper