Interactive efficacy in college English classrooms improves language acquisition in modern educational environments. Traditional evaluation techniques often fail to identify student interest and achievement due to subjective grading and fragmented analysis. This research proposed MM-IEANet, a multimodal deep learning framework that integrates visual, acoustic, and textual data on student activities to investigate and predict interactive efficacy. Our primary goal is to develop a robust system that accurately represents real-time student performance and instructor feedback through automated, multimodal processing. MM-IEANet extracts meaningful representations using modality-specific encoders—CNNs for visual features, BiLSTMs for text, and 1D-CNNs and LSTM for audio. A Cross-Modal Transformer Fusion module integrates these representations, and a Hierarchical Attention Network predicts efficacy by modality. On a custom-labeled dataset, MM-IEANet demonstrated an over 12% improvement in classification accuracy and a considerable reduction in score prediction error. The attention processes explained which modalities most affected grading. Analysis showed that auditory attributes correlated most with interactive success, followed by textual quality and visual presentation coherence. The approach also generalized well across student cohorts. In conclusion, MM-IEANet uses multimodal machine learning to evaluate English classroom engagement in a scalable, interpretable, and accurate manner.
Qiuyuan Tang (Sun,) studied this question.