September 1, 2024

A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition

Key Points

Key points are not available for this paper at this time.

Abstract

Audio-LLM introduces audio modality into a large language model (LLM) to enable a powerful LLM to recognize, understand, and generate audio. However, during speech recognition in noisy environments, we observed the presence of illusions and repetition issues in audio-LLM, leading to substitution and insertion errors. This paper proposes a transcription prompt-based audio-LLM by introducing an ASR expert as a transcription tokenizer and a hybrid Autoregressive (AR) Non-autoregressive (NAR) decoding approach to solve the above problems. Experiments on 10k-hour WenetSpeech Mandarin corpus show that our approach decreases 12. 2% and 9. 6% CER relatively on TestNet and TestMeeting evaluation sets compared with baseline. Notably, we reduce the decoding repetition rate on the evaluation set to zero, showing that the decoding repetition problem has been solved fundamentally.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Yangze Li

Northwestern Polytechnical University

Xiong Wang

The University of Sydney

Songjun Cao

Tencent (China)

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Also consider