Domain-specific multimodal neural machine translation (DMNMT) aims to translate source language domain sentences into target language by incorporating images as additional contextual information. However, domain-specific multimodal scenarios frequently suffer from visual imbalance issues, such as one sentence corresponding to multiple images or even no images at all. Effectively integrating visual information into text to enhance domain sentence translation performance under visual imbalance issues is one of the critical challenges for DMNMT, especially for domain-related terms. To tackle these domain-specific visual imbalance problems, this article introduces a virtual domain distillation-enhanced multimodal fusion with the awareness of multiview correlations to enhance the robustness and performance of domain machine translation across various multimodal domain scenarios. We first adopt a multiview correlation-aware cross-modal distillation strategy to generate virtual domain visual scenes by extracting visual correlations among all images through multikernel representations. Subsequently, we integrate pseudo-domain visual scenes into text to improve the performance of domain-specific machine translation. Our proposed approach has the ability to capture domain visual representations across different scenarios, and contributing to more effective domain-specific translation. We conduct expensive experiments on three domain-specific and general-domain benchmark datasets. Experimental results demonstrate that our proposed approach achieves state-of-the-art (SOTA) machine translation scores on most test sets. The in-depth analysis demonstrates the effectiveness and robustness of our proposed approach for domain machine translation.
Hou et al. (Thu,) studied this question.