Many important applications rely on Dynamic Facial Expression Recognition (DFER), including affective computing, mental health monitoring, and human–computer interaction. The computational cost of current state-of-the-art approaches is increased and important contextual clues are lost due to their reliance on Face Detection and Alignment (FDA) preprocessing. This paper proposes a novel FDA-free hybrid DFER framework TSM-Transformer that integrates EfficientNetV2-Lite for lightweight spatial feature extraction, Temporal Shift Modules (TSM) for parameter-free local motion encoding, and a Transformer-based temporal fusion mechanism for global sequence modeling. By processing full-frame video inputs, the proposed model preserves both facial and body cues, enhancing robustness under real-world conditions with variations in lighting, occlusions, head poses, and background complexity. Experimental evaluation on a multi-class emotion dataset demonstrates that the TSM-Transformer achieves state-of-the-art performance, with 91.38% accuracy, 0.895 F1-score, and consistently high AUC values (0.92–0.97) across seven emotion categories. Notably, the model records significant gains in challenging classes such as Surprise (+ 5%) and Fear (+ 8%) over strong baselines, while maintaining real-time inference capability without computationally expensive preprocessing. Ablation studies confirm the complementary strengths of TSM and Transformer modules in capturing both micro- and macro-expression dynamics. The proposed approach offers a scalable, deployment-ready solution for DFER in unconstrained environments, with potential for extension to multimodal emotion recognition.
Saraswat et al. (Thu,) studied this question.