What question did this study set out to answer?

April 3, 2026Open Access

A multimodal learning framework using mel-spectrogram convolutional neural networks for English vocabulary acquisition

Key Points

The study aims to develop a multimodal learning framework using mel-spectrogram convolutional neural networks to enhance English vocabulary acquisition.
Developed a multimodal learning framework using MSCNNs.
Processed auditory and visual data through separate MSCNN branches.
Fused features from both modalities for a unified representation.
Trained on a dataset comprising word-image-audio pairs.
Achieved an accuracy of 97.2% in the vocabulary acquisition task.
Recorded a sensitivity of 96.6% and specificity of 98.1%.
Obtained an F1-score of 96.9% with a precision of 96.7%.
Showcased a Matthews Correlation Coefficient of 0.954 and an AUC of 0.978.

Abstract

Recent advancements in artificial intelligence (AI) and deep learning (DL) have transformed numerous fields, particularly language acquisition, where multimodal learning has emerged as a key approach to improving educational outcomes. The integration of visual and auditory modalities in English vocabulary learning enhances comprehension, engagement, and retention. However, traditional single-modal methods remain limited in scope and effectiveness. This study proposes a novel multimodal learning framework based on Mel-Spectrogram Convolutional Neural Networks (MSCNNs) to enhance vocabulary acquisition by combining the processing of images and audio. The framework first processes each modality through distinct MSCNN branches, then fuses the extracted features to create a unified multimodal representation. Trained on a comprehensive dataset of word-image-audio pairs, the model effectively captures the complementary strengths of both modalities, resulting in faster and more robust learning. The experimental results indicate that the proposed model achieves superior performance across a range of metrics, with an accuracy of 97.2%, a sensitivity of 96.6%, a specificity of 98.1%, an F1-score of 96.9%, a Recall of 96.5%, and a Precision of 96.7%. Additionally, the model exhibits an outstanding Matthews Correlation Coefficient (MCC) of 0.954, an Area Under the Curve (AUC) of 0.978, and minimal error values, with a Mean Absolute Error (MAE) of 0.03 and an Root Mean Squared Error (RMSE) of 0.11, showcasing its robustness and reliability in making accurate predictions. The proposed MSCNN-based approach offers a promising direction for enhancing vocabulary learning by integrating multiple sensory modalities to create a richer, more effective, and engaging educational experience.

Mark Helpful

Bookmark

Relay

View Full Paper