What question did this study set out to answer?

The aim is to develop a framework for predicting the emotional states and participation levels of non-native English learners in classrooms by utilizing multimodal data.

April 28, 2026Open Access

A study of personalized behavior prediction for non-native English learners under the cross-modal adversarial learning framework

Key Points

The aim is to develop a framework for predicting the emotional states and participation levels of non-native English learners in classrooms by utilizing multimodal data.
Developed an integrated model combining Transformer, CNN, and LSTM for feature extraction.
Binarized multimodal data, extracted facial expression features, and fused emotion and expression data to create a training dataset.
Conducted prediction experiments on emotional states and expression patterns, measuring accuracy across various emotions.
Achieved 99% accuracy for predicting 'happy' and 97% for 'surprised', but only 94% for 'angry' emotions.
The model showed stable performance across emotion categories, especially excelling in predicting 'happy' emotions with 99% accuracy.
Fusion of multimodal data improved prediction accuracy between 92% and 97%, enhancing model stability overall.

Abstract

This article proposes a cross modal adversarial learning framework based on multi-level feature extraction and Transformer CNN-LSTM integrated model, which is used to analyze the emotional dynamics of non-native English learners’ classroom participation. Through multimodal inputs such as voice, facial expressions, and behavioral data, this article predicts students’ classroom participation and emotional states, and improves prediction accuracy by integrating these data. In the experiment, the multimodal data was first binarized and facial expression features were extracted. Then, the emotion data and expression data were weighted and fused to construct the training dataset. In the emotion mode prediction experiment, the model had high accuracy in predicting “happy” and “surprised” emotions, with 99% and 97% respectively, but had low accuracy in predicting “angry” emotions, at about 94%. In the prediction of expression patterns, the model performs stably in all emotion categories, especially achieving an accuracy of 99% in predicting “happy” emotions. Furthermore, in the fusion data mode, the prediction accuracy ranges from 92 to 97%, indicating that the fused data improves the stability of the model’s sentiment prediction. However, there is still room for improvement in the model’s prediction of “anger” emotions, with relatively large fluctuations in accuracy and error. Through data density analysis, the density of fused data is slightly lower than that of single emotion or expression data, but its error range is relatively reasonable, indicating that the fusion strategy has certain advantages in multimodal emotion prediction. Overall, the integrated model proposed in this article has demonstrated good performance in processing multimodal sentiment data, especially in terms of sentiment recognition accuracy and model stability, and has high practical value.

Bookmark

View Full Paper

Bookmark

View Full Paper

A study of personalized behavior prediction for non-native English learners under the cross-modal adversarial learning framework

Key Points

Abstract

Cite This Study