Facial emotion recognition (FER) systems can serve as a valuable non-invasive tool for assessing emotional states linked to mental health. However, two main issues hinder their adoption in clinical settings, including privacy concerns inherent to centralized data processing and the lack of transparent decision-making processes. This paper proposes a privacy-preserving and explainable FER framework that implements a collaborative distributed training approach for lightweight convolutional neural network (CNN) architectures across three heterogeneous datasets: RAF-DB, ExpW, and FER2013. SHapley Additive exPlanations (SHAP) guided the optimization of CNN filter configurations, prioritizing high accuracy, cross-dataset generalization, and interpretable, trustworthy explanations. A multidimensional explainability evaluation framework is developed that combines perturbation-based faithfulness and feature localization metrics into a Global Explanation Quality Score (GEQS) for quantitative assessment of explanation quality. A qualitative user study was conducted to assess alignment between human perception and quantitative explainability metrics. Guided by explainability-driven evaluation, a lightweight CNN achieves 74.3% mean accuracy, highlighting its effectiveness for cross-dataset generalization. Results demonstrate alignment between quantitative GEQS and human evaluation, with both identifying the same model architecture. However, qualitative analysis shows that highlighting emotion-relevant facial features does not always ensure user trust. The practical viability of the proposed FER system in resource-constrained clinical environments is demonstrated through implementation on Raspberry Pi 4 with integrated SHAP explainability, achieving 60 ms inference time.
Shehada et al. (Thu,) studied this question.