Recent advancements in large language models (LLMs) have significantly improved text-to-image (T2I) generation, enabling systems to produce visually compelling and semantically meaningful images. However, preserving fine-grained semantic consistency in generated images, particularly in response to complex and region-specific textual prompts, remains a key challenge. In this work, we propose a context-aware hierarchical agent mechanism that integrates a semantic condensation strategy to enhance attention efficiency and maintain critical visual-textual alignment. By dynamically fusing contextual information, the method effectively balances computational efficiency and ensures semantic alignment with textual descriptions. Experimental results demonstrate improved visual coherence and semantic consistency across diverse prompts, validated through quantitative metrics and qualitative analysis. Our contributions include: (i) introducing a novel semantic condensation strategy that enhances attention efficiency while preserving critical feature information; (ii) developing a new hierarchical agent attention mechanism to enhance computation efficiency; (iii) designing an iterative feedback method based on CLIP Score to improve image diversity and overall quality.
Fu et al. (Tue,) studied this question.