Multimodal named-entity recognition (MNER) aims to identify entity information by leveraging multimodal features. With recent research shifting to multi-image scenarios, existing methods overlook modality noise and lack effective cross-modal interaction, leading to prominent semantic gaps. This study innovatively integrates symmetric multimodal fusion with contrastive learning, proposing a novel model with a symmetric-encoder collaborative architecture. To mitigate the noise, a modality refinement encoder maps each modality to an exclusive space, while an aligned encoder bridges gaps via contrastive learning in a shared space, surpassing the superficial cross-modal mapping of existing models. Building on these encoders, the symmetric fusion module achieves deep bidirectional fusion, breaking traditional one-way or concatenation-based limitations. Experiments on two datasets show the model outperforms state-of-the-art methods, with ablation experiments validating the symmetric encoder’s uniqueness for consistent multimodal learning.
Wu et al. (Sat,) studied this question.