Multimodal emotion recognition has emerged as a promising direction for capturing the complexity of human affective states by integrating physiological and behavioral signals. However, challenges remain in addressing feature redundancy, modality heterogeneity, and insufficient inter-modal supervision. In this paper, we propose a novel Multimodal Disentangled Knowledge Distillation framework that explicitly disentangles modality-shared and modality-specific features and enhances cross-modal knowledge transfer via a graph-based distillation module. Specifically, we introduce a dual-stream representation learning architecture that separates common and unique subspaces across modalities. To facilitate effective information interaction, we design a directed and learnable modality graph, where each edge represents the semantic transfer strength from one modality to another. We validate our method on two benchmark datasets-MAHNOB-HCI and DEAP-for both regression and classification tasks, under subject-dependent and subject-independent protocols. Experimental results demonstrate that our method achieves state-of-the-art performance, with statistical significance confirmed by paired two-tailed t-tests. In addition, qualitative analysis of the learned modality graph and t-SNE embeddings further illustrates the effectiveness of our feature disentanglement and dynamic knowledge transfer design. This work offers a unified, interpretable, and robust framework for multimodal emotion understanding and lays the foundation for affective computing in real-world human-machine interaction scenarios.
Gao et al. (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: