September 1, 2024Open Access

Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding

Key Points

Key points are not available for this paper at this time.

Abstract

Automated audio captioning (AAC) is an audio-to-text task to describe audio contents in natural language. Recently, the advancements in large language models (LLMs), with improvements in training approaches for audio encoders, have opened up possibilities for improving AAC. Thus, we explore enhancing AAC from three aspects: 1) a pre-trained audio encoder via consistent ensemble distillation (CED) is used to improve the effectivity of acoustic tokens, with a querying transformer (Q-Former) bridging the modality gap to LLM and compress acoustic tokens; 2) we investigate the advantages of using a LLAMA 2 with 7B parameters as the decoder; 3) another pre-trained LLM corrects text errors caused by insufficient training data and annotation ambiguities. Both the audio encoder and text decoder are optimized by low-rank adaptation (LoRA). Experiments show that each of these enhancements is effective. Our method obtains a 33.0 SPIDEr-FL score, outperforming the winner of DCASE 2023 Task 6A.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Jizhong Liu

PLA Academy of Military Science

Gang Li

Fudan University Shanghai Cancer Center

Junbo Zhang

Beijing Jiaotong University

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study