Entity matching is a fundamental task in data management, supporting applications such as product matching in e-commerce, linking scholarly records, unifying location data, and cross-source fact-checking. Its generalized form, known as generalized entity matching (GEM), extends the challenge to heterogeneous sources across structured, semi-structured, and unstructured data, where schema mismatch, structural variation, and domain drift often make supervision costly or infeasible. Existing approaches either rely heavily on labeled data or fail to generalize across domains, lacking the ability to use both structural and content information at scale. We present GLEAM, an end-to-end generalized entity matching framework that dynamically adapts to data structure and domain characteristics without requiring labeled pairs. GLEAM integrates (i) a structure- and content-aware blocking module that constructs a weighted heterogeneous graph encoding schema context and semantic similarity, guided by a lightweight LLM-based attribute importance estimator, (ii) an adaptive connector that models candidate score distributions via a two-component GMM and performs Bayesian updates using online feedback from matching, enabling entity-specific early stopping, and (iii) a hierarchical reasoning matching module that infers domain and attribute hierarchies from a few examples and performs comparative selection over candidate sets with domain-aware prompts. Across a diverse set of heterogeneous datasets, GLEAM consistently achieves strong end-to-end F1 while reducing unnecessary LLM calls. It outperforms leading unsupervised and LLM-based baselines and remains competitive with supervised methods, while achieving up to a 25.7% improvement in F1. We further provide probabilistic analyses of the connector's thresholding behavior and update rules, along with ablation studies on graph weighting, adaptivity, and hierarchical prompting.
Chen et al. (Mon,) studied this question.