ABSTRACT Generating photo‐realistic images from natural language descriptions is a challenging task at the intersection of natural language processing and computer vision. Text‐to‐image synthesis involves generating visual images in a way which naturally matches the semantic meaning of the input text. Recent diffusion‐based models have demonstrated strong performance in image fidelity but are slow in inference and exhibit coarse semantic alignment. To overcome the two problems above and allow images to be more faithful (realistic) to texts and semantics in the wild, we propose a novel hybrid architecture called DM‐GAN+ATT+CL (dynamic memory GAN + contrastive learning and attention mechanisms). Our method proceeds in a two‐step manner: we first produce low‐resolution images based on the DM‐GAN model with dual attention modules and then refine the results through a memory‐based feature refinement mechanism. Contrastive learning was then utilized on a separate dataset with high resolution image‐text pairs to enhance feature discrimination and strengthen semantic consistency. The result is richer semantic relevance, stronger image variation and better visual quality. Extensive experimental results across multiple benchmark datasets—CUB, Oxford‐102, MS‐COCO, and MM‐CelebA‐HQ—demonstrate that the proposed DM‐GAN+ATT+CL framework consistently outperforms state‐of‐the‐art baselines. Notably, it achieved an R‐precision of 95.24, an inception score (IS) of 38.43, and a Fréchet inception distance (FID) of 11.30 on the MS‐COCO dataset, with similarly strong and consistent performance observed across the other datasets. These findings indicate that our approach substantially enriches the diversity and reality of synthetic images, promising a better future for text‐image matching.
Habib et al. (Wed,) studied this question.