What question did this study set out to answer?

The aim is to improve domain-specific multimodal neural machine translation performance by addressing visual imbalance issues.

March 1, 2026

Virtual Domain-Guided Cross-Modal Distillation With Multiview Correlation Awareness for Domain-Specific Multimodal Neural Machine Translation

Key Points

The aim is to improve domain-specific multimodal neural machine translation performance by addressing visual imbalance issues.
Adopted a multiview correlation-aware cross-modal distillation strategy.
Generated virtual domain visual scenes through multikernel representations of images.
Integrated pseudo-domain visual scenes with text for enhanced translation results.
Achieved state-of-the-art machine translation scores on various benchmark datasets.
Demonstrated increased robustness in domain machine translation across different scenarios.

Abstract

Domain-specific multimodal neural machine translation (DMNMT) aims to translate source language domain sentences into target language by incorporating images as additional contextual information. However, domain-specific multimodal scenarios frequently suffer from visual imbalance issues, such as one sentence corresponding to multiple images or even no images at all. Effectively integrating visual information into text to enhance domain sentence translation performance under visual imbalance issues is one of the critical challenges for DMNMT, especially for domain-related terms. To tackle these domain-specific visual imbalance problems, this article introduces a virtual domain distillation-enhanced multimodal fusion with the awareness of multiview correlations to enhance the robustness and performance of domain machine translation across various multimodal domain scenarios. We first adopt a multiview correlation-aware cross-modal distillation strategy to generate virtual domain visual scenes by extracting visual correlations among all images through multikernel representations. Subsequently, we integrate pseudo-domain visual scenes into text to improve the performance of domain-specific machine translation. Our proposed approach has the ability to capture domain visual representations across different scenarios, and contributing to more effective domain-specific translation. We conduct expensive experiments on three domain-specific and general-domain benchmark datasets. Experimental results demonstrate that our proposed approach achieves state-of-the-art (SOTA) machine translation scores on most test sets. The in-depth analysis demonstrates the effectiveness and robustness of our proposed approach for domain machine translation.

Bookmark

Virtual Domain-Guided Cross-Modal Distillation With Multiview Correlation Awareness for Domain-Specific Multimodal Neural Machine Translation

Key Points

Abstract

Cite This Study