December 4, 2025Open Access

ArecaNet: Robust Facial Emotion Recognition via Assembled Residual Enhanced Cross-Attention Networks for Emotion-Aware Human–Computer Interaction

Key Points

ArecaNet achieves improved facial emotion recognition performance, enhancing the accuracy of human–computer interaction systems.
The framework integrates multiple feature streams while minimizing information loss and optimizing feature extraction at 97.8% accuracy.
Convolutional neural networks are employed within an innovative ensemble approach that incorporates pre-trained sub-networks for robust emotion classification.
This methodology highlights the importance of feature integration, underscoring potential advancements in emotion-aware technology applications.

Abstract

Recently, the convergence of advanced sensor technologies and innovations in artificial intelligence and robotics has highlighted facial emotion recognition (FER) as an essential component of human–computer interaction (HCI). Traditional FER studies based on handcrafted features and shallow machine learning have shown a limited performance, while convolutional neural networks (CNNs) have improved nonlinear emotion pattern analysis but have been constrained by local feature extraction. Vision transformers (ViTs) have addressed this by leveraging global correlations, yet both CNN- and ViT-based single networks often suffer from overfitting, single-network dependency, and information loss in ensemble operations. To overcome these limitations, we propose ArecaNet, an assembled residual enhanced cross-attention network that integrates multiple feature streams without information loss. The framework comprises (i) channel and spatial feature extraction via SCSESResNet, (ii) landmark feature extraction from specialized sub-networks, (iii) iterative fusion through residual enhanced cross-attention, (iv) final emotion classification from the fused representation. Our research introduces a novel approach by integrating pre-trained sub-networks specialized in facial recognition with an attention mechanism and our uniquely designed main network, which is optimized for size reduction and efficient feature extraction. The extracted features are fused through an iterative residual enhanced cross-attention mechanism, which minimizes information loss and preserves complementary representations across networks. This strategy overcomes the limitations of conventional ensemble methods, enabling seamless feature integration and robust recognition. The experimental results show that the proposed ArecaNet achieved accuracies of 97.0% and 97.8% using the public databases, FER-2013 and RAF-DB, which were 4.5% better than the existing state-of-the-art method, PAtt-Lite, for FER-2013 and 2.75% for RAF-DB, and achieved a new state-of-the-art accuracy for each database.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Jae Myung Kim

Pohang University of Science and Technology

Gyu Ho Choi

Chosun University

Journals

Sensors

Actions

Institutions

Chosun University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

ArecaNet: Robust Facial Emotion Recognition via Assembled Residual Enhanced Cross-Attention Networks for Emotion-Aware Human–Computer Interaction

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study