What question did this study set out to answer?

The aim is to develop a framework that recognizes emotions and behavioral states of online learners using webcam video analysis.

June 7, 2026Open Access

Intelligent emotion and behavior recognition for online learners using deep affective computing

Key Points

The aim is to develop a framework that recognizes emotions and behavioral states of online learners using webcam video analysis.
Developed an affective-computing framework using convolutional networks with attention modules for facial emotion detection.
Employed bidirectional GRU to predict four behavioral states using 16-frame sequences from webcam video.
Evaluated against the FER2013 benchmark and EmoDetect dataset with 120 learners.
Achieved 94.5% accuracy on FER2013 Private Test and 86.9% accuracy with a macro-F1 of 85.4% for behavior recognition.
Model demonstrates improved performance over facial-expression-only baselines.
System operates at approximately 25 frames per second on mid-range GPU for potential real-time application.

Abstract

Teachers in online classes often have limited visibility into how students are feeling and whether they are staying engaged. This paper presents a practical affective-computing framework that analyses webcam video to recognize both facial emotions and learning-centered behavioral states during live virtual sessions. Student faces are first detected in each frame, then encoded using a convolutional network with an attention module to emphasize informative facial regions. To capture how affect evolves over time, the extracted features are grouped into fixed 16-frame sequences and modelled with a bidirectional GRU to predict four behavioral states: engaged, confused, frustrated and disengaged. We evaluate emotion recognition on the FER2013 benchmark under the standard Train/Public Test/Private Test split, and we evaluate behavior recognition on a publicly available online-learning facial expression dataset (EmoDetect, Kaggle) used for the behavior-recognition experiments from 120 learners (800 annotated temporal segments; inter-annotator agreement κ≈0.75). The proposed model reaches 94.5% accuracy on FER2013 Private Test and achieves 86.9% accuracy with a macro-F1 of 85.4% on the four-class behavior task, improving over facial-expression-only baselines. Profiling further shows that the full pipeline can run at around 25 frames per second on a mid-range GPU, indicating feasibility for real-time monitoring. These results suggest that combining attention-based facial representations with temporal modelling can provide instructors with clearer, webcam-observable indicators of learners' affect and engagement during online teaching.

AI에게 질문

Bookmark

View Full Paper