June 1, 2023

All in One: Exploring Unified Video-Language Pre-Training

Key Points

Key points are not available for this paper at this time.

Abstract

Mainstream Video-Language Pre-training (VLP) models 10, 26, 64 consist of three parts, a video encoder, a text encoder, and a video-text fusion Transformer. They pursue better performance via utilizing heavier unimodal encoders or multimodal fusion Transformers, resulting in increased parameters with lower efficiency in downstream tasks. In this work, we for the first time introduce an end-to-end VLP model, namely all-in-one Transformer, that embeds raw video and textual signals into joint representations using a unified backbone architecture. We argue that the unique temporal information of video data turns out to be a key barrier hindering the design of a modality-agnostic Transformer. To overcome the challenge, we introduce a novel and effective token rolling operation to encode temporal representations from video clips in a non-parametric manner. The careful design enables the representation learning of both video-text multimodal inputs and unimodal inputs using a unified model. Our pretrained ali-in-one Transformer is transferred to various downstream video-text tasks after fine-tuning, including text-video retrieval, video-question answering, multiple choice and video captioning. State-of-the-art performances with the minimal model FLOPs on ten datasets demonstrate the superiority of our method compared to the competitive counterparts. The code and pretrained models are available at https://github.com/showlab/all-in-one.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Jinpeng Wang

Nanjing Agricultural University

Yixiao Ge

Tencent (China)

Rui Yan

Nanjing University of Science and Technology

Actions

Institutions

Columbia University

University of Hong Kong

National University of Singapore

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

All in One: Exploring Unified Video-Language Pre-Training

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study