July 29, 2024Open Access

Efficiently Gluing Pre-trained Language and Vision Models for Image Captioning

Key Points

Key points are not available for this paper at this time.

Abstract

Vision-and-language pre-training models have achieved impressive performance for image captioning. But most of them are trained with millions of paired image-text data and require huge memory and computing overhead. To alleviate this, we try to stand on the shoulders of large-scale pre-trained language models (PLM) and pre-trained vision models (PVM) and efficiently connect them for image captioning. There are two major challenges: one is that language and vision modalities have different semantic granularity (e.g., a noun may cover many pixels), and the other is that the semantic gap still exists between the pre-trained language and vision models. To this end, we design a lightweight and efficient connector to glue PVM and PLM, which holds a criterion of selection-then-transformation . Specifically, in the selection phase, we treat each image as a set of patches instead of pixels. We select salient image patches and cluster them into visual regions to align with text. Then, to effectively reduce the semantic gap, we propose to map the selected image patches into text space through spatial and channel transformations. With training on image captioning datasets, the connector learns to bridge the semantic granularity and semantic gap via backpropagation, preparing for the PLM to generate descriptions. Experimental results on the MSCOCO and Flickr30k datasets demonstrate that our method yields comparable performance to existing works. By solely training the small connector, we achieve a CIDEr performance of 132.2% on the MSCOCO Karpathy test split. Moreover, our findings reveal that fine-tuning the PLM can further enhance performance potential, resulting in a CIDEr score of 140.6%. Code and models are available at https://github.com/YuanEZhou/PrefixCap .

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper