Key points are not available for this paper at this time.
With the recent success of the pre-training technique for NLP and image-linguistic tasks, some video-linguistic pre-training works are gradually developed to improve video-text related downstream tasks. However, most of the existing multimodal models are pre-trained for understanding tasks, leading to a pretrain-finetune discrepancy for generation tasks. This paper proposes UniVL: a Unified Video and Language pre-training model for both multimodal understanding and generation. It comprises four components, including two single-modal encoders, a cross encoder, and a decoder with the Transformer backbone. Five objectives, including video-text joint, conditioned masked language model (CMLM), conditioned masked frame model (CMFM), video-text alignment, and language reconstruction, are designed to train each of the components. We further develop two pre-training strategies, stage by stage pre-training (StagedP) and enhanced video representation (EnhancedV), to make the training process of the UniVL more effective. The pre-train is carried out on a sizeable instructional video dataset HowTo100M. Experimental results demonstrate that the UniVL can learn strong video-text representation and achieves state-of-the-art results on five downstream tasks.
Building similarity graph...
Analyzing shared references across papers
Loading...
Huaishao Luo
Jingdong (China)
Lei Ji
Huawei Technologies (China)
Botian Shi
Beijing Academy of Artificial Intelligence
Building similarity graph...
Analyzing shared references across papers
Loading...
Luo et al. (Sat,) studied this question.
synapsesocial.com/papers/6a0f06e1b7cc3b883f22fe23 — DOI: https://doi.org/10.48550/arxiv.2002.06353