Key points are not available for this paper at this time.
Speech and language models are advancing towards universality.A single model can now handle translations across 200 languages and transcriptions for over 100 languages.Universal models simplify development, deployment, and importantly, transfer knowledge to less-resourced languages or modes.This paper introduces M2BART, a streamlined multilingual and multimodal framework for encoderdecoder models.It employs a self-supervised speech tokenizer, bridging speech and text, and is pre-trained with a unified objective for both unimodal and multimodal, unsupervised and supervised data.When tested on Spanish-to-English and English-to-Hokkien translations, M2BART consistently surpassed competitors.We also showcase an innovative translation model enabling zero-shot transfers even without labeled data.
Chen et al. (Mon,) studied this question.