Post-processing text transcriptions using a bag of hallucinations (BoH) reduced word error rate (WER) and safeguarded against problematic hallucinations induced by non-speech audio.
Post-processing text transcriptions using a bag of hallucinations can reduce word error rate and mitigate hallucinations in the Whisper ASR model.
Hallucinations of deep neural models are amongst key challenges in automatic speech recognition (ASR). In this paper, we investigate hallucinations of the Whisper ASR model induced by non-speech audio segments present during inference. By inducting hallucinations with various types of sounds, we show that there exists a set of hallucinations that appear frequently. We then study hallucinations caused by the augmentation of speech with such sounds. Finally, we describe the creation of a bag of hallucinations (BoH) that allows to remove the effect of hallucinations through the post-processing of text transcriptions. The results of our experiments show that such post-processing is capable of reducing word error rate (WER) and acts as a good safeguard against problematic hallucinations.
Barański et al. (Wed,) conducted a other in Whisper ASR hallucinations. Bag of hallucinations (BoH) post-processing was evaluated on Word error rate (WER). Post-processing text transcriptions using a bag of hallucinations (BoH) reduced word error rate (WER) and safeguarded against problematic hallucinations induced by non-speech audio.