Abstract E-learning has transformed the educational landscape, particularly in higher education, by offering flexible, scalable, and often more accessible learning environments. Lecture recordings, in particular, have become a widely used resource, offering students the ability to revisit class content at their own pace. However, the sheer volume and length of these recordings can make it difficult for learners to locate specific types of instructional content efficiently. This paper presents a hierarchical multimodal approach to segment and classify lecture recordings based on the nature of the teaching activity taking place. The proposed method integrates audio processing techniques and natural language understanding models to distinguish between various communicative functions, such as content delivery, task explanation, or organizational announcements. By leveraging both acoustic and textual cues, the system enables more effective navigation through educational videos, facilitating targeted access to relevant material. Experimental results demonstrate high overall accuracy and notable improvements over existing approaches, especially in identifying structured instructional discourse. Nonetheless, challenges persist in detecting informal or less clearly defined interactions. This work aims to enhance the usability of recorded lectures and support more personalized and efficient learning experiences.
Sapena et al. (Wed,) studied this question.