Key points are not available for this paper at this time.
Despite recent significant strides achieved by diffusion-based Text-to-Image (T2I) models, current systems are still less capable of ensuring decent compositional generation aligned with text prompts, particularly for the multi-object generation. In this work, we first show the fundamental reasons for such misalignment by identifying issues related to low attention activation and mask overlaps. Then we propose a compositional finetuning framework with two novel objectives, the Separate loss and the Enhance loss, that reduce object mask overlaps and maximize attention scores, respectively. Unlike conventional test-time adaptation methods, our model, once finetuned on critical parameters, is able to directly perform inference given an arbitrary multi-object prompt, which enhances the scalability and generalizability. Through comprehensive evaluations, our model demonstrates superior performance in image realism, text-image alignment, and adaptability, significantly surpassing established baselines. Furthermore, we show that training our model with a diverse range of concepts enables it to generalize effectively to novel concepts, exhibiting enhanced performance compared to models trained on individual concept pairs.
Building similarity graph...
Analyzing shared references across papers
Loading...
Zhipeng Bao
Yijun Li
Krishna Kumar Singh
University of Illinois Urbana-Champaign
Carnegie Mellon University
Adobe Systems (United States)
Building similarity graph...
Analyzing shared references across papers
Loading...
Bao et al. (Fri,) studied this question.
www.synapsesocial.com/papers/68e60868b6db64358759b63c — DOI: https://doi.org/10.1145/3641519.3657527