Key points are not available for this paper at this time.
Our work introduces the Zero-Shot Speech Translation (ZeroST) framework, leveraging the synergistic potential of pre trained multilingual speech and text foundation models. Inspired by recent advances in multimodal foundation models, ZeroST utilizes a Query Transformer (Q-Former) to seamlessly connect a speech foundation model, such as Whisper or Massively Multilingual Speech (MMS), with a text translation model like No-Language-Left-Behind (NLLB). Our proposed learning framework enables the model to perform the speech-to-text translation in a zero-shot manner, bypassing the need for explicit supervision from expensive-to-collect speech-text translation pairs during training. Our extensive experiments, notably on the Europarl-ST benchmark, demonstrate that ZeroST achieves results comparable to those of a strong cascaded translation system and significantly outperforms baseline models. This promising approach paves the way for future research in zero-shot speech translation.
Building similarity graph...
Analyzing shared references across papers
Loading...
Khurana et al. (Sun,) studied this question.
www.synapsesocial.com/papers/68e59e92b6db643587538ab3 — DOI: https://doi.org/10.21437/interspeech.2024-1088
Sameer Khurana
Chiori Hori
Antoine Laurent
Building similarity graph...
Analyzing shared references across papers
Loading...
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: