What question did this study set out to answer?

The central aim is to improve job posting deduplication in large datasets through a multi-stage methodology.

March 3, 2026Open Access

On the Task of Job Posting Deduplication Using Embedding-Based Filtering and LLM Validation

Key Points

The central aim is to improve job posting deduplication in large datasets through a multi-stage methodology.
Data preprocessing and semantic vectorization using a text embedding model.
Clustering-based grouping to limit comparisons to coherent clusters.
Validation of candidate duplicates with large language models assessing semantic equivalence.
Evaluated on an augmented dataset of 50,000 job postings, producing 6,669 candidate pairs.
GPT-4o achieved the highest F1-score of 95.10%.
The Phi-4 model showed competitive performance at 92.58% with lower computational costs.

Abstract

This paper addresses the challenge of deduplicating job postings in large, heterogeneous datasets by introducing an efficient, multi-stage methodology that combines embedding-based filtering with Large Language Model (LLM) validation. The proposed system begins with data preprocessing and semantic vectorization of key textual fields using a text embedding model. To reduce the computational cost of exhaustive pairwise comparisons, a clustering-based grouping mechanism is employed to restrict comparisons to semantically coherent clusters. Candidate duplicates are then validated using LLMs, which assess semantic equivalence across highlighted differences in job titles, descriptions, companies, and locations. The proposed system is evaluated on an augmented dataset of 50,000 job postings, producing 6669 candidate pairs for validation. Among the evaluated models, GPT-4o achieved the highest F1-score (95.10%), while the lightweight Phi-4 model demonstrated competitive performance (92.58%) with significantly lower computational cost. These findings demonstrate that the proposed hybrid framework achieves high semantic precision while remaining scalable for continuous large-scale deployment.

On the Task of Job Posting Deduplication Using Embedding-Based Filtering and LLM Validation

Key Points

Abstract

Cite This Study