January 1, 2017Open Access

Multi-modal Summarization for Asynchronous Collection of Text, Image, Audio and Video

Key Points

Key points are not available for this paper at this time.

Abstract

The rapid increase in multimedia data transmission over the Internet necessitates the multi-modal summarization (MMS) from collections of text, image, audio and video. In this work, we propose an extractive multi-modal summarization method that can automatically generate a textual summary given a set of documents, images, audios and videos related to a specific topic. The key idea is to bridge the semantic gaps between multi-modal content. For audio information, we design an approach to selectively use its transcription. For visual information, we learn the joint representations of text and images using a neural network. Finally, all of the multimodal aspects are considered to generate the textual summary by maximizing the salience, non-redundancy, readability and coverage through the budgeted optimization of submodular functions. We further introduce an MMS corpus in English and Chinese, which is released to the public 1 . The experimental results obtained on this dataset demonstrate that our method outperforms other competitive baseline methods.

KI fragen

Bookmark

View Full Paper