What question did this study set out to answer?

The main goal is to develop a framework for generalized entity matching that adapts to various data structures and domains without the need for labeled data.

May 20, 2026

Generalized Entity Matching with Adaptivity via Large Language Models

Key Points

The main goal is to develop a framework for generalized entity matching that adapts to various data structures and domains without the need for labeled data.
Developed GLEAM, an end-to-end framework for generalized entity matching.
Implemented a structure- and content-aware blocking module that uses weighted heterogeneous graphs.
Utilized an adaptive connector with Bayesian updates based on online feedback.
GLEAM achieved a 25.7% improvement in F1 score compared to leading methods.
Consistently outperformed unsupervised and LLM-based baselines while remaining competitive with supervised methods.
Demonstrated strong end-to-end F1 performance across diverse datasets.

Abstract

Entity matching is a fundamental task in data management, supporting applications such as product matching in e-commerce, linking scholarly records, unifying location data, and cross-source fact-checking. Its generalized form, known as generalized entity matching (GEM), extends the challenge to heterogeneous sources across structured, semi-structured, and unstructured data, where schema mismatch, structural variation, and domain drift often make supervision costly or infeasible. Existing approaches either rely heavily on labeled data or fail to generalize across domains, lacking the ability to use both structural and content information at scale. We present GLEAM, an end-to-end generalized entity matching framework that dynamically adapts to data structure and domain characteristics without requiring labeled pairs. GLEAM integrates (i) a structure- and content-aware blocking module that constructs a weighted heterogeneous graph encoding schema context and semantic similarity, guided by a lightweight LLM-based attribute importance estimator, (ii) an adaptive connector that models candidate score distributions via a two-component GMM and performs Bayesian updates using online feedback from matching, enabling entity-specific early stopping, and (iii) a hierarchical reasoning matching module that infers domain and attribute hierarchies from a few examples and performs comparative selection over candidate sets with domain-aware prompts. Across a diverse set of heterogeneous datasets, GLEAM consistently achieves strong end-to-end F1 while reducing unnecessary LLM calls. It outperforms leading unsupervised and LLM-based baselines and remains competitive with supervised methods, while achieving up to a 25.7% improvement in F1. We further provide probabilistic analyses of the connector's thresholding behavior and update rules, along with ablation studies on graph weighting, adaptivity, and hierarchical prompting.

Ask AI

Mark Helpful

Bookmark

Relay