January 1, 2023Open Access

D2TV: Dual Knowledge Distillation and Target-oriented Vision Modeling for Many-to-Many Multimodal Summarization

Key Points

Key points are not available for this paper at this time.

Abstract

Many-to-many multimodal summarization (M3S) task aims to generate summaries in any language with document inputs in any language and the corresponding image sequence, which essentially comprises of multimodal monolingual summarization (MMS) and multimodal cross-lingual summarization (MXLS) tasks. Although much work has been devoted to either MMS or MXLS, little research pays attention to the M3S task. Besides, existing studies mainly focus on 1) utilizing MMS to enhance MXLS via knowledge distillation without considering the performance of MMS or 2) improving MMS models by filtering summary-unrelated visual features with implicit learning or explicitly complex training objectives. In this paper, we first introduce a general and practical task, i.e., M3S. Further, we propose a dual knowledge distillation and target-oriented vision modeling framework for the M3S task. Specifically, the dual knowledge distillation method guarantees that the knowledge of MMS and MXLS can be transferred to each other and thus mutually prompt both of them. To offer target-oriented visual features, a simple yet effective target-oriented contrastive objective is designed and responsible for discarding needless visual information. Extensive experiments on the many-to-many setting show the effectiveness of the proposed approach. Additionally, we contribute a many-to-many multimodal summarization (lmttM3Sum) dataset with 44 languages to facilitate future research.

Bookmark

View Full Paper

Cite This Study

Liang et al. (Sun,) studied this question.

synapsesocial.com/papers/6a16fe640f965e9c137bd6e2 https://doi.org/https://doi.org/10.18653/v1/2023.findings-emnlp.994

Bookmark

View Full Paper