What question did this study set out to answer?

The aim is to improve entity matching by integrating external knowledge and handling semantic information effectively.

June 20, 2026Open Access

MIEM‐CA: A Multigranularity Information‐Enhanced Entity Matching Method Based on Collaborative Agents

Key Points

The aim is to improve entity matching by integrating external knowledge and handling semantic information effectively.
Proposed MIEM-CA method combines multiagent information enhancement, multigranularity semantic encoding, and numerical-aware strategies.
Utilizes extensive external knowledge and autonomous agents for better entity representation.
Evaluated on 10 benchmark datasets including structured and dirty data types.
Achieved an average F1 score improvement of 6.35% on structured datasets compared to five baselines.
Improved F1 score by 9.07% on dirty datasets, demonstrating substantial enhancement in performance.
Overall, MIEM-CA showed an 8.11% improvement across all datasets evaluated.

Abstract

Entity matching (EM) aims to identify records from different data sources referring to the same real‐world entity. Despite remarkable advances with pretrained language models (PLMs), existing PLM‐based matchers still encounter significant challenges in effectively integrating external knowledge, representing semantic information at multiple granularities, and handling numerical snippets. To address these challenges, we propose a multigranularity information‐enhanced EM method based on collaborative agents (MIEM‐CA), featuring three key components: (1) a multiagent information enhancement module (MI) that leverages extensive external knowledge, the decision‐making and collaboration capabilities of autonomous agents, and the semantic comprehension power of large language models (LLMs), by integrating attribute selection, web search, and feature extraction agents to improve the completeness of entity representation; (2) a multigranularity semantic encoder (ME) that incrementally captures and integrates token‐, attribute‐, and entity‐level semantics, along with their cross‐level correlations, across hierarchical representations spanning the token, attribute, and entity layers (ELs); and (3) a numerical‐aware agent module (NA) that employs the chain‐of‐thought (CoT) strategy to extract numerical information effectively, leverages LLMs to infer the semantic types of these numerical values, and calculates their semantic‐aware numerical similarity. Comprehensive experiments on 10 benchmark datasets, which cover structured, dirty, and textual EM settings, demonstrate that, compared with five baseline methods, MIEM‐CA achieves an average F 1 score improvement of 6.35% on structured datasets, 9.07% on the dirty datasets, and 8.11% across all datasets.

Read Full Paperexternally

Ask AI

Mark Helpful

Bookmark

Relay

View Full Paper