Recent advances have extended large language models (LLMs) beyond text to process diverse audio inputs such as speech, environmental sounds, and music. This lecture offers a comprehensive overview of state-of-the-art audio language models capable of jointly understanding or generating audio and text. We first categorize models into two types: speech-to-speech dialogue systems, and instruction-following text-based LLMs equipped with audio encoders. We then describe three core components: audio representation, token generation strategies, and learning paradigms. In particular, we discuss the trade-offs between phonetic and acoustic tokens, showing how their complementary use enables unified understanding of general audio. We then explore sequence modeling techniques for joint speech and text token generation—including hierarchical, interleaved, or parallel generation—that support intelligible, low-latency, and full-duplex interactions. The lecture also covers curriculum learning-based training methodologies, including speech-text alignment pretraining, instruction tuning, and policy optimization. Finally, we introduce datasets and benchmarks for building and evaluating audio-enabled LLMs. This tutorial aims to equip researchers and practitioners with a structured understanding of this rapidly evolving field and to inspire further exploration into audio language modeling. Work supported by JTEKT Corporation.
Building similarity graph...
Analyzing shared references across papers
Loading...
Ryota Komatsu
Tokyo Institute of Technology
The Journal of the Acoustical Society of America
Tokyo Institute of Technology
Building similarity graph...
Analyzing shared references across papers
Loading...
Ryota Komatsu (Wed,) studied this question.
synapsesocial.com/papers/6a056714a550a87e60a1f0ee — DOI: https://doi.org/10.1121/10.0040818