June 4, 2024Open Access

Discrete Multimodal Transformers with a Pretrained Large Language Model for Mixed-Supervision Speech Processing

Key Points

Key points are not available for this paper at this time.

Abstract

Recent work on discrete speech tokenization has paved the way for models that can seamlessly perform multiple tasks across modalities, e.g., speech recognition, text to speech, speech to speech translation. Moreover, large language models (LLMs) pretrained from vast text corpora contain rich linguistic information that can improve accuracy in a variety of tasks. In this paper, we present a decoder-only Discrete Multimodal Language Model (DMLM), which can be flexibly applied to multiple tasks (ASR, T2S, S2TT, etc.) and modalities (text, speech, vision). We explore several critical aspects of discrete multi-modal models, including the loss function, weight initialization, mixed training supervision, and codebook. Our results show that DMLM benefits significantly, across multiple tasks and datasets, from a combination of supervised and unsupervised training. Moreover, for ASR, it benefits from initializing DMLM from a pretrained LLM, and from a codebook derived from Whisper activations.

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper

Cite This Study

Trinh et al. (Tue,) studied this question.

synapsesocial.com/papers/68e664a3b6db6435875f0c1a https://doi.org/https://doi.org/10.48550/arxiv.2406.06582

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

KI fragen

Bookmark

View Full Paper