As the impact of chronic mental disorders increases, multimodal sentiment analysis (MSA) has emerged to improve diagnosis and treatment. In this paper, our approach leverages disentangled representation learning to address modality heterogeneity with self-supervised learning as a guidance. The self-supervised learning is proposed to generate pseudo unimodal labels and guide modality-specific representation learning, preventing the acquisition of meaningless features. Additionally, we also propose a text-centric fusion to effectively mitigate the impacts of noise and redundant information and fuse the acquired disentangled representations into a comprehensive multimodal representation. We evaluate our model on three publicly available benchmark datasets for multimodal sentiment analysis and a privately collected dataset focusing on schizophrenia counseling. The experimental results demonstrate state-of-the-art performance across various metrics on the benchmark datasets, surpassing related works. Furthermore, our learning algorithm shows promising performance in real-world applications, outperforming our previous work and achieving significant progress in schizophrenia assessment.
Chang et al. (Wed,) studied this question.