• A novel framework is proposed which enables stable training of multitext image editing within one model without the need for per-sample or per-prompt optimization. • A region-based attention mechanism is adopted to ensure spatially-localized editing. • With the help of these designs, real-time interaction is enabled and several practical applications such as sequential editing can be achieved in high-quality. Leveraging the abundant knowledge learned from pre-trained multi-modal models like CLIP has recently proved to be effective for text-guided image editing. Though convincing results have been made when combining the image generator StyleGAN with CLIP, most methods need to train separate models for different prompts, and irrelevant regions are often changed after editing due to the lack of spatial disentanglement. We propose a novel framework that can edit different images according to different prompts in one model. Besides, an innovative region-based spatial attention mechanism is adopted to explicitly guarantee the locality of editing. Experiments mainly in the face domain verify the feasibility of our framework and show that when multi-text editing and local editing are accomplishable, our method can complete practical applications like sequential editing and regional style transfer.
Xiao et al. (Fri,) studied this question.