What question did this study set out to answer?

The research aims to develop an FDA-free hybrid framework for dynamic facial expression recognition in video sequences.

April 12, 2026Open Access

TSM-Transformer: a hybrid EfficientNetV2-Lite and temporal shift model for facial expression recognition in videos

Key Points

The research aims to develop an FDA-free hybrid framework for dynamic facial expression recognition in video sequences.
Developed TSM-Transformer framework combining EfficientNetV2-Lite and Temporal Shift Modules.
Utilized a Transformer for global sequence modeling without Face Detection and Alignment preprocessing.
Evaluated performance on a multi-class emotion dataset, focusing on accuracy, F1-score, and AUC values.
Achieved 91.38% accuracy and 0.895 F1-score on the emotion dataset.
Record high AUC values ranging from 0.92 to 0.97 across seven emotion categories.
Significant performance improvements in challenging emotion categories like Surprise (+5%) and Fear (+8%).

Abstract

Many important applications rely on Dynamic Facial Expression Recognition (DFER), including affective computing, mental health monitoring, and human–computer interaction. The computational cost of current state-of-the-art approaches is increased and important contextual clues are lost due to their reliance on Face Detection and Alignment (FDA) preprocessing. This paper proposes a novel FDA-free hybrid DFER framework TSM-Transformer that integrates EfficientNetV2-Lite for lightweight spatial feature extraction, Temporal Shift Modules (TSM) for parameter-free local motion encoding, and a Transformer-based temporal fusion mechanism for global sequence modeling. By processing full-frame video inputs, the proposed model preserves both facial and body cues, enhancing robustness under real-world conditions with variations in lighting, occlusions, head poses, and background complexity. Experimental evaluation on a multi-class emotion dataset demonstrates that the TSM-Transformer achieves state-of-the-art performance, with 91.38% accuracy, 0.895 F1-score, and consistently high AUC values (0.92–0.97) across seven emotion categories. Notably, the model records significant gains in challenging classes such as Surprise (+ 5%) and Fear (+ 8%) over strong baselines, while maintaining real-time inference capability without computationally expensive preprocessing. Ablation studies confirm the complementary strengths of TSM and Transformer modules in capturing both micro- and macro-expression dynamics. The proposed approach offers a scalable, deployment-ready solution for DFER in unconstrained environments, with potential for extension to multimodal emotion recognition.

Mark Helpful

Bookmark

Relay

View Full Paper