What question did this study set out to answer?

The research aims to develop a multimodal deep learning framework to improve educational resource matching accuracy.

March 5, 2026Open Access

Intelligent Matching Methods for Educational Resources Under a Multimodal Deep Learning Framework

Key Points

The research aims to develop a multimodal deep learning framework to improve educational resource matching accuracy.
Proposed the Attentive Extreme Gradient Sequential Bidirectional Memory Net (AEGSBMN) framework.
Integrated BERT, ResNet-50, and Wav2Vec for feature extraction.
Utilized Bi-LSTM and attention mechanisms to process multimodal data.
Classified with an Extreme Gradient Boosting (XGBoost) model for decision-making.
Evaluated on a multimodal educational dataset consisting of text, images, and audio.
Achieved 98.45% matching accuracy and improved Recall@1 (88%) and MRR (91%).
Outperformed existing models like IoT-PAMD and CM-LEQA.
Demonstrated significant enhancement in semantic alignment and learner comprehension.

Abstract

• Multimodal attention–BiLSTM–XGBoost framework proposed • Integrates BERT, ResNet-50 and Wav2Vec features • Achieves 98.45% matching accuracy in education • Improves Recall@1 (88%) and MRR (91%) • Outperforms IoT-PAMD and CM-LEQA models The rapid development of artificial intelligence (AI) and sensor technologies has enabled personalized and adaptive learning experiences. However, traditional educational resource-matching systems struggle to process multimodal data and capture complex semantic relationships, limiting their effectiveness in diverse learning environments. To address these challenges, this study proposes the Attentive Extreme Gradient Sequential Bidirectional Memory Net (AEGSBMN), a novel multimodal deep learning framework that integrates Bidirectional Long Short-Term Memory (Bi-LSTM) networks with an attention mechanism for contextual feature weighting, and an Extreme Gradient Boosting (XGBoost) classifier for final decision-making. The framework was evaluated using a multimodal educational dataset containing textual content, annotated images, and synchronized speech data. Preprocessing included spectral gatingfor audio denoising, histogram equalization for image enhancement, and tokenization withstop-word removalfor text normalization. Feature extraction employed BERT embeddings for text, ResNet-50 for visual data, and Wav2Vec for acoustic signals. Extracted features were fused through Bi-LSTM layers with attention to capture temporal dependencies and highlight salient multimodal features, followed by XGBoost for classification. Experimental results demonstrate that AEGSBMN achieves a matching accuracy of 97% , with improved recall, ranking metrics (Recall@1: 88%, Recall@5: 94%, MRR: 91%), and reduced error rates (15.01%). These findings indicate that AEGSBMN effectively enhances semantic alignment, learner comprehension, and adaptive resource retrieval in multimodal educational environments. The framework was implemented in Python , using PyTorch and Hugging Face Transformers for deep learning, and XGBoost for classification.

Intelligent Matching Methods for Educational Resources Under a Multimodal Deep Learning Framework

Key Points

Abstract

Cite This Study