Multimodal relation extraction (MRE) aims to jointly identify relationships between entities from text and images and plays a crucial role in knowledge graph construction. However, existing methods face two key challenges: (i) inconsistent image quality and semantic misalignment in readily available datasets, particularly from social media, and (ii) the limitations of traditional unidirectional attention mechanisms in capturing fine-grained semantic associations, often leading to information loss and noise. To address these issues, we propose LLM-VGA (Large Language Model-augmented Visual Generation with hierarchical Alignment), a novel framework that integrates an LLM-guided diffusion model with a hierarchical bidirectional cross-modal attention mechanism. Specifically, an adapter injects the semantic reasoning capabilities of LLM into a diffusion model to generate high-quality, text-consistent images, thereby constructing pseudo-aligned multimodal data. Furthermore, a hierarchical bidirectional attention module captures interactions between text and both generated and source images at multiple granularities, alleviating information imbalance and improving alignment. Experimental results on the MNRE and MORE datasets show that LLM-VGA achieves F1 scores of 90.29% and 73.02%, respectively, significantly surpassing state-of-the-art baselines and confirming its effectiveness in enhancing multimodal relation extraction.
Li et al. (Thu,) studied this question.