Facial expression recognition (FER) is crucial for endowing service robots with emotional perception capabilities. Achieving high-performance facial expression recognition hinges on effectively balancing the capture of subtle local textures with the understanding of overall facial configurations. However, coordinating local feature variations with global semantic dependencies in unconstrained environments while maintaining semantic alignment remains a challenge. To address this issue, we propose FER-SDAM, a network architecture based on hierarchical attention collaboration. Through a dual-attention hierarchical collaboration mechanism, this architecture introduces an Attention Consistency Loss (ACL) to explicitly align shallow structural awareness with deep global dependencies. It simultaneously captures structural sensitivity and cross-regional correlations, facilitating the effective fusion of local structural information with global semantics, thereby balancing accuracy, robustness, and computational efficiency. We conducted extensive experiments on AffectNet, RAF-DB, and their subsets containing occlusion and pose variations, achieving accuracy rates of 68.12%, 66.68%, and 88.87% on the AffectNet-7, AffectNet-8, and RAF-DB datasets, respectively. The experimental results demonstrate that FER-SDAM achieves a critical balance between accuracy and efficiency, delivering highly competitive recognition performance while maintaining low computational overhead, making it an ideal solution for real-time deployment in service robots.
Zhang et al. (Thu,) studied this question.