April 2, 2024Open Access

VLRM: Vision-Language Models act as Reward Models for Image Captioning

Key Points

Key points are not available for this paper at this time.

Abstract

In this work, we present an unsupervised method for enhancing an image captioning model (in our case, BLIP2) using reinforcement learning and vision-language models like CLIP and BLIP2-ITM as reward models. The RL-tuned model is able to generate longer and more comprehensive descriptions. Our model reaches impressive 0.90 R@1 CLIP Recall score on MS-COCO Carpathy Test Split. Weights are available at https://huggingface.co/sashakunitsyn/vlrm-blip2-opt-2.7b.

Read Full Paperexternally

AI에게 질문

Bookmark

View Full Paper

Cite This Study

Dzabraev et al. (Tue,) studied this question.

synapsesocial.com/papers/68e70c60b6db6435876866d9 https://doi.org/https://doi.org/10.48550/arxiv.2404.01911

AI에게 질문

Bookmark

View Full Paper