Key points are not available for this paper at this time.
A big convergence of language, vision, and multimodal pretraining is emerging. In this work, we introduce a general-purpose multimodal foundation model BEIT-3, which achieves excellent transfer performance on both vision and vision-language tasks. Specifically, we advance the big convergence from three aspects: backbone architecture, pretraining task, and model scaling up. We use Multiway Transformers for general-purpose modeling, where the modular architecture enables both deep fusion and modality-specific encoding. Based on the shared backbone, we perform masked “language” modeling on images (Imglish), texts (English), and image-text pairs (“parallel sentences”) in a unified manner. Experimental results show that BEIT-3 obtains remarkable performance on object detection (COCO), semantic segmentation (ADE20K), image classification (ImageNet), visual reasoning (NLVR2), visual question answering (VQAv2), image captioning (COCO), and cross-modal retrieval (Flickr30K, COCO).
Building similarity graph...
Analyzing shared references across papers
Loading...
Wenhui Wang
Hunan University of Arts and Science
Hangbo Bao
Microsoft Research (United Kingdom)
Dong Li
Tongji University
Microsoft (Finland)
Building similarity graph...
Analyzing shared references across papers
Loading...
Wang et al. (Thu,) studied this question.
synapsesocial.com/papers/69dff8c71827a1d0b1255afe — DOI: https://doi.org/10.1109/cvpr52729.2023.01838
Synapse has enriched one closely related paper. Consider it for comparative context: