Multimodal Named Entity Recognition (MNER) aims to improve entity prediction by fusing textual and visual information. Most current MNER methods face two main issues: 1) they do not adequately appreciate the necessity of incorporating external knowledge into the model and the presence of knowledge redundancy; 2) during cross-modal fusion, they fail to achieve text-guided multimodal integration, resulting in an excessive introduction of image noise. To address these issues, we propose a novel framework PKTF, which primarily consists of two stages: the prior assisted knowledge generation stage and the multimodal named entity recognition stage. In the first stage, we use Intern VL2-8B to generate prior assisted knowledge, aiming to provide additional contextual information for the original text. In the second stage, we design a Text-Max-Directed Fusion Module (TMDF). Specifically, we use gates to modulate the max pooling attention scores guided by the text, in order to obtain text-guided saliency attention scores. We can use this score to maximize the extraction of information favorable to MNER from the image features, while ensuring the dominance of the text. Experimental results show that our method is competitive compared to existing models, and it achieves F1-scores of 75.43% on the Twitter-2015 dataset and 88.74% on the Twitter-2017 dataset, respectively.
HE et al. (Thu,) studied this question.