What type of study is this?

September 5, 2025Open Access

MemAttn‐CL: Unified Memory, Attention, and Contrastive Learning for Enhanced Text‐to‐Image Generation

Puntos clave

The approach enhances semantic consistency and visual quality in synthetic images by utilizing contrastive learning.
The DM-GAN+ATT+CL framework achieved an R-precision of 95.24 and improved image fidelity in text-to-image tasks.
Attention mechanisms were integrated into the two-step process, producing high-quality, photo-realistic images from text descriptions.
Extensive experimental results demonstrate that this method consistently outperforms several state-of-the-art models across multiple datasets.

Resumen

ABSTRACT Generating photo‐realistic images from natural language descriptions is a challenging task at the intersection of natural language processing and computer vision. Text‐to‐image synthesis involves generating visual images in a way which naturally matches the semantic meaning of the input text. Recent diffusion‐based models have demonstrated strong performance in image fidelity but are slow in inference and exhibit coarse semantic alignment. To overcome the two problems above and allow images to be more faithful (realistic) to texts and semantics in the wild, we propose a novel hybrid architecture called DM‐GAN+ATT+CL (dynamic memory GAN + contrastive learning and attention mechanisms). Our method proceeds in a two‐step manner: we first produce low‐resolution images based on the DM‐GAN model with dual attention modules and then refine the results through a memory‐based feature refinement mechanism. Contrastive learning was then utilized on a separate dataset with high resolution image‐text pairs to enhance feature discrimination and strengthen semantic consistency. The result is richer semantic relevance, stronger image variation and better visual quality. Extensive experimental results across multiple benchmark datasets—CUB, Oxford‐102, MS‐COCO, and MM‐CelebA‐HQ—demonstrate that the proposed DM‐GAN+ATT+CL framework consistently outperforms state‐of‐the‐art baselines. Notably, it achieved an R‐precision of 95.24, an inception score (IS) of 38.43, and a Fréchet inception distance (FID) of 11.30 on the MS‐COCO dataset, with similarly strong and consistent performance observed across the other datasets. These findings indicate that our approach substantially enriches the diversity and reality of synthetic images, promising a better future for text‐image matching.

Leer artículo completoexternamente

Preguntar a la IA

Me gusta

Guardar

Ver artículo completo