Key points are not available for this paper at this time.
Multimodal Named Entity Recognition (MNER) aims to leverage visual information to identify entity boundaries and categories in social media posts. Existing methods mainly adopt heterogeneous architecture, with ResNet (CNN-based) and BERT (Transformer-based) dedicated to modeling visual and textual features, respectively. However, current approaches still face the following issues: (1) Weak cross-modal correlations and poor semantic consistency. (2) Suboptimal fusion results when visual objects and textual entities are inconsistent. To this end, we propose a Hybrid Transformer with Visual-Enhanced Cross-Modal Multi-level Interaction (VEC-MNER) model for MNER. Specifically, compared to heterogeneous architectures, we propose a new homogeneous Hybrid Transformer Architecture, which naturally reduces the heterogeneity. Moreover, we design the Correlation-Aware Alignment (CAA-Encoder) layer and the Correlation-Aware Deep Fusion (CADF-Encoder) layer, combined with contrastive learning, to achieve more effective implicit alignment and deep semantic fusion between modalities, respectively. We also construct a Correlation-Aware (CA) module that can effectively reduce heterogeneity between modalities and alleviate visual deviation. Experimental results demonstrate that our approach achieves SOTA performance, achieving 74.89% and 87.51% F1-score on Twitter-2015 and Twitter-2017, respectively.
Wei et al. (Thu,) studied this question.