September 1, 2024Open Access

ZeroST: Zero-Shot Speech Translation

SKSameer KhuranaMitsubishi Electric (United States)CHChiori HoriMitsubishi Electric (United States)ALAntoine LaurentLe Mans Université

Key Points

Key points are not available for this paper at this time.

Abstract

Our work introduces the Zero-Shot Speech Translation (ZeroST) framework, leveraging the synergistic potential of pre trained multilingual speech and text foundation models. Inspired by recent advances in multimodal foundation models, ZeroST utilizes a Query Transformer (Q-Former) to seamlessly connect a speech foundation model, such as Whisper or Massively Multilingual Speech (MMS), with a text translation model like No-Language-Left-Behind (NLLB). Our proposed learning framework enables the model to perform the speech-to-text translation in a zero-shot manner, bypassing the need for explicit supervision from expensive-to-collect speech-text translation pairs during training. Our extensive experiments, notably on the Europarl-ST benchmark, demonstrate that ZeroST achieves results comparable to those of a strong cascaded translation system and significantly outperforms baseline models. This promising approach paves the way for future research in zero-shot speech translation.

Ask AI

Helpful

Bookmark

View Full Paper