This paper addresses the challenge of deduplicating job postings in large, heterogeneous datasets by introducing an efficient, multi-stage methodology that combines embedding-based filtering with Large Language Model (LLM) validation. The proposed system begins with data preprocessing and semantic vectorization of key textual fields using a text embedding model. To reduce the computational cost of exhaustive pairwise comparisons, a clustering-based grouping mechanism is employed to restrict comparisons to semantically coherent clusters. Candidate duplicates are then validated using LLMs, which assess semantic equivalence across highlighted differences in job titles, descriptions, companies, and locations. The proposed system is evaluated on an augmented dataset of 50,000 job postings, producing 6669 candidate pairs for validation. Among the evaluated models, GPT-4o achieved the highest F1-score (95.10%), while the lightweight Phi-4 model demonstrated competitive performance (92.58%) with significantly lower computational cost. These findings demonstrate that the proposed hybrid framework achieves high semantic precision while remaining scalable for continuous large-scale deployment.
Thivaios et al. (Sun,) studied this question.