Recent advancements in artificial intelligence (AI) and deep learning (DL) have transformed numerous fields, particularly language acquisition, where multimodal learning has emerged as a key approach to improving educational outcomes. The integration of visual and auditory modalities in English vocabulary learning enhances comprehension, engagement, and retention. However, traditional single-modal methods remain limited in scope and effectiveness. This study proposes a novel multimodal learning framework based on Mel-Spectrogram Convolutional Neural Networks (MSCNNs) to enhance vocabulary acquisition by combining the processing of images and audio. The framework first processes each modality through distinct MSCNN branches, then fuses the extracted features to create a unified multimodal representation. Trained on a comprehensive dataset of word-image-audio pairs, the model effectively captures the complementary strengths of both modalities, resulting in faster and more robust learning. The experimental results indicate that the proposed model achieves superior performance across a range of metrics, with an accuracy of 97.2%, a sensitivity of 96.6%, a specificity of 98.1%, an F1-score of 96.9%, a Recall of 96.5%, and a Precision of 96.7%. Additionally, the model exhibits an outstanding Matthews Correlation Coefficient (MCC) of 0.954, an Area Under the Curve (AUC) of 0.978, and minimal error values, with a Mean Absolute Error (MAE) of 0.03 and an Root Mean Squared Error (RMSE) of 0.11, showcasing its robustness and reliability in making accurate predictions. The proposed MSCNN-based approach offers a promising direction for enhancing vocabulary learning by integrating multiple sensory modalities to create a richer, more effective, and engaging educational experience.
Juan Zhao (Thu,) studied this question.