Multimodal Emotion Recognition in Conversations (ERC) faces certain challenges due to the contextual sparsity and class imbalance of rare emotions, which are often diluted by frequent neutral or common emotional expressions. To address this, we propose a data-centric framework that enhances the representation of underrepresented emotions via context-aware augmentation. Our approach uses a fine-tuned 7B Small Language Model to generate emotion-induced conversation summaries. These summaries are further utilized for soft context injection during augmentation, guiding the generation of utterance paraphrases and corresponding expressive speech using neural speech synthesis. This helps in augmenting the dialogue turns that belong to rare emotions rather than the entire conversation. A multimodal autoencoder-based fusion model is then trained on text, summaries, and speech embeddings that identify emotions in conversations. Experiments on benchmark datasets (MELD, EmoryNLP, and IEMOCAP) demonstrate that our method achieves significant improvements in detecting rare emotion classes ( F 1 s c o r e > 35 % ), outperforming existing baselines, and at the same time without degrading the overall accuracy. The results show the effectiveness of generative augmentation and soft prompting for building context-aware solutions in affective computing. • Proposed a data-centric framework for emotion recognition in conversations. • Addressed context sparsity using LLM-guided soft prompting and speaker cues. • Fine-tuned LLMs to generate emotion-rich, context-aware summaries of utterances. • Enhanced rare emotion classes via LLM-based paraphrasing and expressive speech. • Achieved notable F1 gains, especially for underrepresented emotion classes.
A. et al. (Wed,) studied this question.