Student engagement is a critical factor influencing teaching effectiveness in university physical education courses. To address common issues such as low attendance and insufficient classroom interaction in elective physical education courses, this study proposes an automated student engagement prediction model based on a multimodal Transformer algorithm. The model first utilizes the University Student Sports and Physical Health Dataset (https://www.ncmi.cn/phda/dataDetails.do?id=CSTR:17970.11.A0032.202412.278.V1.0) as its data source. After preprocessing, multimodal data are filtered and divided into a training set (80%) and a testing set (20%). Feature extraction is then performed on the multimodal data: a One-Dimensional Convolutional Neural Network (1D CNN) combined with Long Short-Term Memory (LSTM) processes sensor data, Bidirectional Encoder Representations from Transformers extracts text features, and Vision Transformer encodes video segments. Next, a hierarchical cross-modal Transformer architecture is designed. This architecture enhances single-modal feature representation through intra-modal self-attention and dynamically aligns heterogeneous data (e.g., the correlation between heart rate changes and “fatigue” text descriptions) using a cross-modal attention mechanism to achieve multimodal interaction. Finally, after fusing the cross-modal features, a fully connected layer outputs the student engagement prediction results. Performance analysis based on the specified data source reveals that the proposed model reduces the mean absolute error by 22.3% in the engagement regression task compared to the single-modal baseline (1D CNN+LSTM), and the F1-score for student engagement prediction increases to 0.81. Ablation experiments confirm the necessity of multimodal fusion; the proposed model achieves over 90% accuracy in student engagement prediction, whereas prediction performance decreases by 17%-35% when only a single modality is used. Furthermore, in terms of operational efficiency, the model can complete engagement prediction for a single class session (a 10-minute data window) within 0.2 s, representing a 40% improvement in evaluation efficiency compared to baseline algorithms, thus meeting real-time classroom monitoring requirements. Therefore, this study significantly enhances the accuracy and real-time capability of student engagement prediction. Its interpretable cross-modal correlation analysis provides an intelligent decision-making basis for optimizing physical education teaching and offers a reference for advancing educational assessment from experience-driven to data-driven approaches.
Jianping Li (Thu,) studied this question.