July 12, 2024Open Access

Separate-and-Enhance: Compositional Finetuning for Text-to-Image Diffusion Models

Key Points

This model yields improved adaptability and compositional generation for multi-object prompts, enhancing text-to-image alignment.
Our findings reveal that addressing low attention activation and mask overlaps is key to better compositional performance.
Assessment using the compositional finetuning framework with Separate and Enhance losses demonstrates marked superiority over baseline models in realism and alignment metrics.

Abstract

Despite recent significant strides achieved by diffusion-based Text-to-Image (T2I) models, current systems are still less capable of ensuring decent compositional generation aligned with text prompts, particularly for the multi-object generation. In this work, we first show the fundamental reasons for such misalignment by identifying issues related to low attention activation and mask overlaps. Then we propose a compositional finetuning framework with two novel objectives, the Separate loss and the Enhance loss, that reduce object mask overlaps and maximize attention scores, respectively. Unlike conventional test-time adaptation methods, our model, once finetuned on critical parameters, is able to directly perform inference given an arbitrary multi-object prompt, which enhances the scalability and generalizability. Through comprehensive evaluations, our model demonstrates superior performance in image realism, text-image alignment, and adaptability, significantly surpassing established baselines. Furthermore, we show that training our model with a diverse range of concepts enables it to generalize effectively to novel concepts, exhibiting enhanced performance compared to models trained on individual concept pairs.

Read Full Paperexternally

AI에게 질문

Bookmark

View Full Paper