What question did this study set out to answer?

The aim is to enhance Multimodal Named Entity Recognition by integrating external knowledge and improving cross-modal fusion.

March 25, 2026Open Access

Multimodal Named Entity Recognition with Prior Knowledge from Multimodal Large Models and Text-Directed Fusion

Key Points

The aim is to enhance Multimodal Named Entity Recognition by integrating external knowledge and improving cross-modal fusion.
Developed a framework called PKTF with two main stages: prior assisted knowledge generation and entity recognition.
Utilized Intern VL2-8B to generate contextual prior knowledge for original text.
Designed Text-Max-Directed Fusion Module to focus attention scores based on text guidance.
Achieved F1-scores of 75.43% on the Twitter-2015 dataset and 88.74% on the Twitter-2017 dataset.
Demonstrated competitive performance compared to existing multimodal models.

Abstract

Multimodal Named Entity Recognition (MNER) aims to improve entity prediction by fusing textual and visual information. Most current MNER methods face two main issues: 1) they do not adequately appreciate the necessity of incorporating external knowledge into the model and the presence of knowledge redundancy; 2) during cross-modal fusion, they fail to achieve text-guided multimodal integration, resulting in an excessive introduction of image noise. To address these issues, we propose a novel framework PKTF, which primarily consists of two stages: the prior assisted knowledge generation stage and the multimodal named entity recognition stage. In the first stage, we use Intern VL2-8B to generate prior assisted knowledge, aiming to provide additional contextual information for the original text. In the second stage, we design a Text-Max-Directed Fusion Module (TMDF). Specifically, we use gates to modulate the max pooling attention scores guided by the text, in order to obtain text-guided saliency attention scores. We can use this score to maximize the extraction of information favorable to MNER from the image features, while ensuring the dominance of the text. Experimental results show that our method is competitive compared to existing models, and it achieves F1-scores of 75.43% on the Twitter-2015 dataset and 88.74% on the Twitter-2017 dataset, respectively.

Bookmark

View Full Paper

Bookmark

View Full Paper

Multimodal Named Entity Recognition with Prior Knowledge from Multimodal Large Models and Text-Directed Fusion

Key Points

Abstract

Cite This Study