In this paper, we explore the generation of face images conditioned on a textual description, as well as the capabilities of the models in editing a machine-generated image on the basis of additional text prompts. We leverage open source state-of-the-art face image generators, StyleGAN models and couple these with the open source multimodal embedding space, CLIP, in an optimisation loop using the method in StyleCLIP to set up our experimental system. We make use of automatic metrics and human ratings to evaluate the results and, in addition, obtain insight into how much automatic metrics are correlated with human ratings. We found compelling evidence that both the text-to-image and editing models based on StyleGAN2 stand out as the better options. In addition, the automatic evaluation metrics are only weakly correlated with human ratings.
Fejjari et al. (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: