Abstract Contextual sentiment recognition is critical for applications such as intelligent customer service and mental health monitoring. However, existing models struggle with multimodal heterogeneity, knowledge scarcity, and inadequate capture of dynamic emotional transitions. To address these challenges, we propose a dual-branch neural encoding–decoding architecture integrated with dynamic knowledge guidance. The model processes multimodal features (text, speech, video) and contextual dependencies through separate branches, incorporating both explicit knowledge (personality traits, domain rules) and implicit knowledge distilled from large language models. A dynamic context window adapts based on emotional shifts to enhance real-time perception. Experiments on IEMOCAP, MELD, and DailyDialog datasets demonstrate that our full model achieves accuracies of 82.1%, 78.3%, and 76.2%, respectively, surpassing state-of-the-art benchmarks including fine-tuned GPT-4. The lightweight version (18.2 M parameters) maintains high inference speed (950 samples/sec) while reducing deployment costs. Furthermore, the model exhibits strong cross-dataset generalization and practical utility. This work provides an efficient framework that effectively addresses core challenges in contextual sentiment recognition, balancing performance with practicality for real-world deployment.
Xiangyu Cheng (Thu,) studied this question.