Facial emotion recognition (FER) is a challenging domain in computer vision,with applications including human-computer interaction, healthcare, education, andsecurity. This comparative study provides a systematic review on the performanceof pretrained deep learning models for FER: using both CNN architectures and ViTmodels. We consolidate and correlate results from recent publications to measure (a)the performance of models considering multiple measures in terms of classificationaccuracy, computational complexity, data requirements and the robustness to practical situations. We studied famous CNN architectures like VGGNet, ResNet, Inception, EfficientNet, and MobileNet and transformer methods like ViT and DeiT, andhybrid CNN-Transformer models. Based on results, the CNN-based models such asVGGNet and ResNet achieved a competitive base of 70–74% accuracy on FER2013while modern architectures like EfficientNet and hybrid transformer models achievean 85–90% accuracy on benchmark datasets. Given the particular conditions andconstraints posed by our application scenarios, this review provides researchers andpractitioners with a general framework on how to choose suitable pretrained models.
Galina Kim (Tue,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: