What question did this study set out to answer?

The study aims to enhance learning analytics by integrating visual, auditory, and textual data to better analyze student behavior.

June 11, 2026Open Access

Enhancing Learning Experience using Multimodal AI: Integrating Vision, Speech and Text in E-learning Systems

Key Points

The study aims to enhance learning analytics by integrating visual, auditory, and textual data to better analyze student behavior.
Proposed a multimodal AI framework combining deep learning encoders and attention-based fusion mechanisms.
Collected data from hybrid learning environments including online videos, voice recordings, and LMS logs at UTE-UD.
Conducted experimental evaluations comparing the multimodal model to unimodal and bimodal baselines.
The multimodal model achieved an 11% improvement in accuracy for student engagement prediction.
Ablation study revealed the critical role of visual and auditory signals in capturing real-time behavioral cues.
Emphasized the limitations of traditional LMS analytics and the benefits of multimodal integration.

Abstract

The increasing adoption of online and blended learning by students has generated a large volume of educational data. However, most existing learning analytics systems are limited to single-modal data sources, primarily text logs, thus failing to capture the full range of student interactions. This study proposes a multimodal artificial intelligence framework that integrates visual, auditory, and textual data to enhance the analysis of learning behaviour in real-world educational settings. The proposed approach employs modality-specific deep learning encoders combined with an attention-based fusion mechanism to model complex interactions across heterogeneous data sources. A comprehensive dataset was collected from hybrid learning environments at the University of Technology and Education – The University of Danang (UTE-UD). It includes online video, voice recordings, and learning management system logs. Experimental results demonstrate that the proposed multimodal model significantly outperforms unimodal and bimodal baselines, achieving up to 11% improvement in accuracy for student engagement prediction. The ablation study further confirms the complementary contributions of each modality, with visual and auditory signals playing a critical role in capturing real-time behavioural cues. Beyond performance gains, the findings highlight the limitations of traditional LMS-based analytics and emphasize the importance of multimodal integration for developing intelligent and adaptive learning systems. This study provides both theoretical and practical contributions by bridging advanced multimodal AI techniques with real-world deployment in higher education. It opens up a viable path towards data-driven and learner-centered education, especially in the context of developing countries like Vietnam.

Enhancing Learning Experience using Multimodal AI: Integrating Vision, Speech and Text in E-learning Systems

Key Points

Abstract

Cite This Study