What question did this study set out to answer?

The goal is to improve multimodal emotion recognition in conversations while addressing deployment challenges on edge devices.

March 25, 2026Open Access

Dual-Distillation Vision-Language Model for Multimodal Emotion Recognition in Conversation with Quantized Edge Deployment

Key Points

The goal is to improve multimodal emotion recognition in conversations while addressing deployment challenges on edge devices.
Proposed a dual-distillation model (DDVLM) using knowledge distillation techniques.
Employed weight-only quantization for efficient edge model deployment.
Used exponential moving average for self-distillation to enhance text feature stability.
Achieved state-of-the-art performance on the MELD dataset.
Enabled real-time inference on resource-constrained edge devices.
Reduced model size and inference latency while maintaining accuracy.

Abstract

Multimodal Emotion Recognition in Conversation (ERC) has attracted attention as a key technology in human–computer interaction, mental healthcare, and intelligent services. However, deploying ERC in real-world settings remains challenging due to reliability gaps across modalities, instability in visual representations, and the high computational cost of large pretrained models. In particular, on resource-constrained edge devices, it is difficult to reduce model size and inference latency while preserving accuracy. To address these challenges, we jointly propose a knowledge-distillation-based multimodal ERC model, called DDVLM, with an edge-optimized Weight-Only Quantization (WOQ) pipeline for efficient edge deployment. DDVLM assigns the textual modality as the teacher and the visual modality as the student, transferring emotion-distribution knowledge to improve non-verbal representations and stabilize multimodal learning. In addition, Exponential Moving Average (EMA)-based self-distillation enhances the consistency and generalization capability of text features. Meanwhile, the proposed WOQ pipeline quantizes linear-layer weights to INT8 while preserving precision-sensitive operations in mixed precision, thereby minimizing accuracy loss and reducing model size, memory usage, and inference latency. Experiments on the MELD dataset demonstrated that the proposed approach achieves state-of-the-art performance while also enabling real-time inference on edge devices such as NVIDIA Jetson. Overall, this work presents a practical ERC framework that jointly considers accuracy and deployability.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Kim et al. (Mon,) studied this question.

synapsesocial.com/papers/69c37c33b34aaaeb1a67ef2b https://doi.org/https://doi.org/10.3390/app16063103

Bookmark

View Full Paper