In E-learning, accurately recognizing the learners’ emotions is a crucial prerequisite for enhancing learning outcomes and teaching quality. Most existing emotion recognition studies identify the emotions of learners by integrating their physiological signals and facial expressions, but these studies often overlook the impact of the different hierarchy of semantics embedded in instructional videos on learners’ emotion. Therefore, we innovatively propose an Emotion Recognition Model based on Multimodal Video Semantic Hierarchy. This model constructs hierarchical video semantics and gradually integrates them through hierarchical stacking. This fused semantic representation is then combined with the learners’ eye-movement physiological signals to enhance emotion recognition performance. Experimental results on three public multimodal physiological datasets, VLMED, HCI-Tagging and DEAP, confirms the model’s effectiveness in emotion recognition tasks.
Li et al. (Wed,) studied this question.