Key points are not available for this paper at this time.
In this paper we address the problem of aligning very long (of-ten more than one hour) audio files to their corresponding textual transcripts in an effective manner. We present an efficient recur-sive technique to solve this problem that works well even on noisy speech signals. The key idea of this algorithm is to turn the forced alignment problem into a recursive speech recognition problem with a gradually restricting dictionary and language model. The algorithm is tolerant to acoustic noise and errors or gaps in the text transcript or audio tracks. We report experimental results on a 3 hour audio file containing TV and radio broadcasts. We will show accurate alignments on speech under a variety of real acoustic conditions such as speech over music and speech over telephone lines. We also report re-sults when the same audio stream has been corrupted with white additive noise or compressed using a popular web encoding for-mat such as RealAudio. This algorithm has been used in our internal multimedia indexing project. It has processed more than 200 hours of audio from var-ied sources, such as WGBH NOVA documentaries and NPR web audio files. The system aligns speech media content in about one to five times realtime, depending on the acoustic conditions of the audio signal. 1.
Moreno et al. (Mon,) studied this question.