What question did this study set out to answer?

The aim is to provide an in-depth overview of audio language models that process speech and text.

May 14, 2026

A comprehensive overview of audio language models

RKRyota KomatsuTokyo Institute of Technology

Key Points

The aim is to provide an in-depth overview of audio language models that process speech and text.
Categorized models into speech-to-speech systems and text-based models with audio encoders.
Described core components like audio representation and token generation strategies.
Discussed training methodologies including curriculum learning and speech-text alignment.
Demonstrated the effectiveness of using both phonetic and acoustic tokens for better audio understanding.
Explored various sequence modeling techniques that enhance speech and text token generation.
Introduced guidelines for datasets and benchmarks to evaluate audio language models.

Abstract

Recent advances have extended large language models (LLMs) beyond text to process diverse audio inputs such as speech, environmental sounds, and music. This lecture offers a comprehensive overview of state-of-the-art audio language models capable of jointly understanding or generating audio and text. We first categorize models into two types: speech-to-speech dialogue systems, and instruction-following text-based LLMs equipped with audio encoders. We then describe three core components: audio representation, token generation strategies, and learning paradigms. In particular, we discuss the trade-offs between phonetic and acoustic tokens, showing how their complementary use enables unified understanding of general audio. We then explore sequence modeling techniques for joint speech and text token generation—including hierarchical, interleaved, or parallel generation—that support intelligible, low-latency, and full-duplex interactions. The lecture also covers curriculum learning-based training methodologies, including speech-text alignment pretraining, instruction tuning, and policy optimization. Finally, we introduce datasets and benchmarks for building and evaluating audio-enabled LLMs. This tutorial aims to equip researchers and practitioners with a structured understanding of this rapidly evolving field and to inspire further exploration into audio language modeling. Work supported by JTEKT Corporation.

Demander à l'IA

Bookmark

View Full Paper

Cite This Study

Ryota Komatsu (Wed,) studied this question.

synapsesocial.com/papers/6a056714a550a87e60a1f0ee https://doi.org/https://doi.org/10.1121/10.0040818

Demander à l'IA

Bookmark

View Full Paper