Teachers in online classes often have limited visibility into how students are feeling and whether they are staying engaged. This paper presents a practical affective-computing framework that analyses webcam video to recognize both facial emotions and learning-centered behavioral states during live virtual sessions. Student faces are first detected in each frame, then encoded using a convolutional network with an attention module to emphasize informative facial regions. To capture how affect evolves over time, the extracted features are grouped into fixed 16-frame sequences and modelled with a bidirectional GRU to predict four behavioral states: engaged, confused, frustrated and disengaged. We evaluate emotion recognition on the FER2013 benchmark under the standard Train/Public Test/Private Test split, and we evaluate behavior recognition on a publicly available online-learning facial expression dataset (EmoDetect, Kaggle) used for the behavior-recognition experiments from 120 learners (800 annotated temporal segments; inter-annotator agreement κ≈0.75). The proposed model reaches 94.5% accuracy on FER2013 Private Test and achieves 86.9% accuracy with a macro-F1 of 85.4% on the four-class behavior task, improving over facial-expression-only baselines. Profiling further shows that the full pipeline can run at around 25 frames per second on a mid-range GPU, indicating feasibility for real-time monitoring. These results suggest that combining attention-based facial representations with temporal modelling can provide instructors with clearer, webcam-observable indicators of learners' affect and engagement during online teaching.
Priya et al. (Fri,) studied this question.