What question did this study set out to answer?

This research aims to enhance facial expression recognition (FER) in realistic classroom environments using a new multimodal framework.

June 7, 2026Open Access

A multimodal vision–language framework using contrastive language-image pre-training for robust facial expression analysis in realistic classroom environments

Key Points

This research aims to enhance facial expression recognition (FER) in realistic classroom environments using a new multimodal framework.
Developed a Contrastive Language-Image Pre-Training (CLIP) based framework for FER.
Employed holistic prompt engineering to utilize limited labeled data effectively.
Validated the model on a dataset of 8,000 facial images labeled into six emotional categories.
The new model outperforms CNN and ViT baselines in recognizing facial expressions.
Effectively handles partial occlusions, changing illumination, and non-frontal viewpoints.
Successfully detects key emotional states important for adaptive tutoring.

Abstract

Facial expression recognition (FER) is a computing process that automatically classifies and recognizes human emotional categories based on a single or a sequence of facial images. In classrooms, FER has also become a crucial technique for bringing the affective computing and learning analytics to an unprecedented level by enabling the instant tracking of student’s feelings (e.g. engaged, bored, confused). Nevertheless, it is still difficult to apply FER in real classroom environment: students tend to show subtle micro-expressions, diverse head poses, partial facial occlusions and illumination conditions constantly changing, all of which commit great challenges to traditional methods. Although Convolutional Neural Networks and Vision Transformers perform well on benchmark datasets, their performance deteriorates drastically in simplified classroom due to limited data and domain shifts. This work introduces a new Contrastive Language-Image Pre-Training (CLIP) based framework that leverages the high generalization power of vision language models for efficient facial expression recognition in classrooms. We develop a holistic prompt engineering method combining the semantics of the expression details in the context of the classroom with the efficient prompt-tuning method, which requires limited labeled data. We propose to focus on only with the last transformer block and the classification head. In this manner we are able to maintain the original visual knowledge base of CLIP and still adapt to the emotional trends of students. Validation on a curated dataset of 8,000 facial images from undergraduate classroom recordings, labeled by domain experts into six expression categories (Engaged, Bored, Happy, Neutral, Sad and Confused), demonstrates that our model outperforms CNN and ViT baselines. It is particularly strong in handling partial occlusions, changing illumination, and non-frontal viewpoints. The framework effectively detects key emotional states that are important for learning analytics and adaptive tutoring. We also address ethical issues that are essential for deploying facial recognition in educational contexts and provide guidelines for responsible AI implementation. This work shows that prompt-based multimodal learning offers a scalable, data-efficient, and ethically aware solution for affect recognition in real classroom environments.

AI에게 질문

Bookmark

View Full Paper