Recent studies have shown the effectiveness of large language models (LLMs) in error correction for automatic speech recognition (ASR). However, existing LLM based error correction approaches use only textual information and neglect pitch accent information, which leads to over-correction. In Japanese language, there are the words, such as “Hashi (Chopsticks)” and “Hashi (Bridge),” that can be distinguished by the difference of the pitch accent and the pitch accent information is important for error correction in Japanese ASR. In this paper, we investigate the use of pre-trained LLM to improve the outputs of Japanese ASR. In particular, we aim to improve error correction by using the N-best hypotheses and pitch accent information generated by ASR as input to LLM. We fine-tune the LLM and design an input prompt to the LLM by combining the N-best hypotheses and the corresponding pitch accent information generated by Whisper. Through this evaluation, we aim to clarify the effect of using pitch accent information on ASR error correction in Japanese.
Suzuki et al. (Wed,) studied this question.