Long-form audio data processing is a challenging task, as audio files can contain different types of speech (adult, child, infant speech), background noice, and silence. Recording and processing software such as LENA (Gilkerson & Richards, 2020) allows processing audio data. However, the tool is closed-source and offers a limited number of processing tasks. LENA proposes labels for speech diarization (who talks and when), but not for the transcription of the audio file into textual data. Transcription is an important step for various NLP tasks, such as morpho-syntactic and sentiment analysis. This tutorial attempts to bridge this gap by presenting preliminary experiments using open-source audio processing NLP and AI tools such as Whisper and WhisperX. We explored the challenges of applying these tools to Korean speech data and we present first results.
Ioana Buhnila (Mon,) studied this question.