What question did this study set out to answer?

This research focuses on enhancing multimodal relation extraction by integrating a large language model with a novel visual generation framework.

May 27, 2026Open Access

LLM-VGA: Large Language Model-Augmented Visual Generation with Hierarchical Bidirectional Alignment for Multimodal Relation Extraction

Key Points

This research focuses on enhancing multimodal relation extraction by integrating a large language model with a novel visual generation framework.
Proposed the LLM-VGA framework combining a LLM-guided diffusion model with hierarchical attention mechanisms.
Utilized experimental datasets MNRE and MORE for evaluation of framework effectiveness.
Applied a bidirectional cross-modal attention mechanism to improve semantic alignment between text and images.
Achieved F1 scores of 90.29% on the MNRE dataset and 73.02% on the MORE dataset.
Significantly surpassed existing state-of-the-art methods in multimodal relation extraction, indicating improved performance.
Effectively reduced information loss and noise in semantic associations through the hierarchical model.

Abstract

Multimodal relation extraction (MRE) aims to jointly identify relationships between entities from text and images and plays a crucial role in knowledge graph construction. However, existing methods face two key challenges: (i) inconsistent image quality and semantic misalignment in readily available datasets, particularly from social media, and (ii) the limitations of traditional unidirectional attention mechanisms in capturing fine-grained semantic associations, often leading to information loss and noise. To address these issues, we propose LLM-VGA (Large Language Model-augmented Visual Generation with hierarchical Alignment), a novel framework that integrates an LLM-guided diffusion model with a hierarchical bidirectional cross-modal attention mechanism. Specifically, an adapter injects the semantic reasoning capabilities of LLM into a diffusion model to generate high-quality, text-consistent images, thereby constructing pseudo-aligned multimodal data. Furthermore, a hierarchical bidirectional attention module captures interactions between text and both generated and source images at multiple granularities, alleviating information imbalance and improving alignment. Experimental results on the MNRE and MORE datasets show that LLM-VGA achieves F1 scores of 90.29% and 73.02%, respectively, significantly surpassing state-of-the-art baselines and confirming its effectiveness in enhancing multimodal relation extraction.

Mark Helpful

Bookmark

Relay

View Full Paper