What does this research mean for the field?

Using a multimodal encoder (CLIP) to guide a generative model (VQGAN) enables zero-shot open domain image generation and editing that achieves higher visual quality than specially trained models like DALL-E and GLIDE. Novelty: ClaimNovelty.METHODOLOGICAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

April 18, 2022Open Access

VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance

Key Points

Key points are not available for this paper at this time.

Abstract

Generating and editing images from open domain text prompts is a challenging task that heretofore has required expensive and specially trained models. We demonstrate a novel methodology for both tasks which is capable of producing images of high visual quality from text prompts of significant semantic complexity without any training by using a multimodal encoder to guide image generations. We demonstrate on a variety of tasks how using CLIP 37 to guide VQGAN 11 produces higher visual quality outputs than prior, less flexible approaches like DALL-E 38, GLIDE 33 and Open-Edit 24, despite not being trained for the tasks presented. Our code is available in a public repository.

AI से पूछें

Bookmark

View Full Paper