What question did this study set out to answer?

This research aims to enhance the effectiveness and efficiency of facial expression recognition systems in real-world environments.

March 27, 2026Open Access

Improving facial expression recognition in realistic environments using deep learning approaches

Key Points

This research aims to enhance the effectiveness and efficiency of facial expression recognition systems in real-world environments.
Conducted a comprehensive review of facial expression recognition challenges in driving contexts.
Introduced DFER-GCViT, a Vision Transformer-based model designed for driver facial expression recognition.
Proposed ShuffViT-DFER, a lightweight hybrid model integrating convolutional and transformer networks.
Explored multimodal approaches using Vision-Language Models, particularly developing CLIVP-FER.
Implemented parameter-efficient fine-tuning and temporal modeling with PE-CLIP for dynamic facial expression recognition.
Demonstrated significant improvements in recognition accuracy and robustness under occlusion, pose variation, and lighting changes.
Showcased enhanced computational efficiency with lightweight models suitable for real-time applications.
Verified effectiveness across diverse benchmark datasets in challenging conditions.

Abstract

Facial expressions are vital channels of non-verbal communication, conveying rich information about emotional and cognitive states in social interactions. Enabling intelligent systems to automatically recognize these expressions makes Facial Expression Recognition (FER) a key task in affective computing. Despite significant progress, existing FER methods still face limitations in effectively and efficiently recognizing facial expressions under real-world conditions, such as those found in driving environments or dynamic, in-the-wild scenarios. This thesis addresses these challenges by proposing several novel deep learning-based models that improve the performance, robustness, and efficiency of FER systems. To this end, several contributions are presented. The work begins with a comprehensive review of FER, highlighting persistent challenges in the field, particularly within the driving context. Subsequently, DFER-GCViT, a Vision Transformer-based architecture tailored for driver FER, is introduced to improve recognition accuracy under conditions of occlusion, pose variation, and lighting changes. To enhance computational efficiency, ShuffViT-DFER is proposed, a lightweight hybrid model that combines convolutional and transformer-based pretrained networks. Furthermore, multimodal approaches are explored using Vision-Language Models (VLMs), particularly CLIP. CLIVP-FER is developed to integrate visual and textual features, enhancing semantic understanding in the driving context. The research then shifts toward general dynamic FER, leveraging parameter-efficient fine-tuning and temporal modeling of CLIP, and maintains strong performance with reduced computational cost through the proposed PE-CLIP. Extensive experiments on benchmark datasets demonstrate the effectiveness of the proposed models across diverse and challenging conditions. The results underscore the importance of optimizing architectures, incorporating multimodal cues, and enabling lightweight FER systems suitable for practical deployment.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper