March 5, 2024

FashionVLM - Fashion Captioning Using Pretrained Vision Transformer and Large Language Model

Key Points

Key points are not available for this paper at this time.

Abstract

Image captioning models automatically generate image descriptions using semantics of the input image. Improvements in image captioning have paved the way for fashion image captioning, to generate more expressive descriptions capturing more attributes of the fashion item. In our current research work, we focus on designing and developing a fashion image captioning model for automating the generation of descriptive captions for fashion items. We call it the Fashion Vision-Language Model (FashionVLM) to capture the multi-modality nature of the model. We utilize a frozen large language model as a text decoder and a vision transformer as an image encoder, connecting these models with a comparatively smaller Querying Transformer. Fashion Captioning Dataset (FACAD) is one of the biggest datasets of fashion items. For fine-tuning on F ACAD, we utilize BLIP-2 pretrain stage two and MS COCO fine-tuned models in three different stages. In Stage One, we use OPT-2.7 and OPT-6.7 based BLIP-2 pretrain stage two models as base models. In Stage Two, we utilize BLIP-2 OPT-2.7 and OPT-6.7 based MS COCO fine-tuned models as base models. In Stage Three, we use Stage One models as the base models for fine-tuning. The OPT-6.7 based Stage Three FashionVLM achieves the best performance compared to the state-of-the-art for fashion captioning on FACAD, providing +4.281 points, + 39.015 points, +5.667 points, and + 3.519 points improvements for BLEU-4, CIDEr, ROUGE-L, and METEOR performance metrics respectively.

Bookmark

FashionVLM - Fashion Captioning Using Pretrained Vision Transformer and Large Language Model

Key Points

Abstract

Cite This Study

Also Consider

Also Consider