February 15, 2020Open Access

UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Key Points

Key points are not available for this paper at this time.

Abstract

With the recent success of the pre-training technique for NLP and image-linguistic tasks, some video-linguistic pre-training works are gradually developed to improve video-text related downstream tasks. However, most of the existing multimodal models are pre-trained for understanding tasks, leading to a pretrain-finetune discrepancy for generation tasks. This paper proposes UniVL: a Unified Video and Language pre-training model for both multimodal understanding and generation. It comprises four components, including two single-modal encoders, a cross encoder, and a decoder with the Transformer backbone. Five objectives, including video-text joint, conditioned masked language model (CMLM), conditioned masked frame model (CMFM), video-text alignment, and language reconstruction, are designed to train each of the components. We further develop two pre-training strategies, stage by stage pre-training (StagedP) and enhanced video representation (EnhancedV), to make the training process of the UniVL more effective. The pre-train is carried out on a sizeable instructional video dataset HowTo100M. Experimental results demonstrate that the UniVL can learn strong video-text representation and achieves state-of-the-art results on five downstream tasks.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Huaishao Luo

Jingdong (China)

Lei Ji

Huawei Technologies (China)

Botian Shi

Beijing Academy of Artificial Intelligence

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study