The increasing adoption of online and blended learning by students has generated a large volume of educational data. However, most existing learning analytics systems are limited to single-modal data sources, primarily text logs, thus failing to capture the full range of student interactions. This study proposes a multimodal artificial intelligence framework that integrates visual, auditory, and textual data to enhance the analysis of learning behaviour in real-world educational settings. The proposed approach employs modality-specific deep learning encoders combined with an attention-based fusion mechanism to model complex interactions across heterogeneous data sources. A comprehensive dataset was collected from hybrid learning environments at the University of Technology and Education – The University of Danang (UTE-UD). It includes online video, voice recordings, and learning management system logs. Experimental results demonstrate that the proposed multimodal model significantly outperforms unimodal and bimodal baselines, achieving up to 11% improvement in accuracy for student engagement prediction. The ablation study further confirms the complementary contributions of each modality, with visual and auditory signals playing a critical role in capturing real-time behavioural cues. Beyond performance gains, the findings highlight the limitations of traditional LMS-based analytics and emphasize the importance of multimodal integration for developing intelligent and adaptive learning systems. This study provides both theoretical and practical contributions by bridging advanced multimodal AI techniques with real-world deployment in higher education. It opens up a viable path towards data-driven and learner-centered education, especially in the context of developing countries like Vietnam.
Trung Hung Vo (Tue,) studied this question.