This article proposes a cross modal adversarial learning framework based on multi-level feature extraction and Transformer CNN-LSTM integrated model, which is used to analyze the emotional dynamics of non-native English learners’ classroom participation. Through multimodal inputs such as voice, facial expressions, and behavioral data, this article predicts students’ classroom participation and emotional states, and improves prediction accuracy by integrating these data. In the experiment, the multimodal data was first binarized and facial expression features were extracted. Then, the emotion data and expression data were weighted and fused to construct the training dataset. In the emotion mode prediction experiment, the model had high accuracy in predicting “happy” and “surprised” emotions, with 99% and 97% respectively, but had low accuracy in predicting “angry” emotions, at about 94%. In the prediction of expression patterns, the model performs stably in all emotion categories, especially achieving an accuracy of 99% in predicting “happy” emotions. Furthermore, in the fusion data mode, the prediction accuracy ranges from 92 to 97%, indicating that the fused data improves the stability of the model’s sentiment prediction. However, there is still room for improvement in the model’s prediction of “anger” emotions, with relatively large fluctuations in accuracy and error. Through data density analysis, the density of fused data is slightly lower than that of single emotion or expression data, but its error range is relatively reasonable, indicating that the fusion strategy has certain advantages in multimodal emotion prediction. Overall, the integrated model proposed in this article has demonstrated good performance in processing multimodal sentiment data, especially in terms of sentiment recognition accuracy and model stability, and has high practical value.
Yaru Zheng (Sat,) studied this question.