The United Nations’ Sustainable Development Goals (UN SDGs) prioritise inclusive and fair employment. However, AI-powered recruitment tools—particularly Large Language Models (LLMs)—raise concerns about potential demographic bias. This paper presents a controlled synthetic dataset and methodology to measure how sensitive attributes (e.g., race, gender, age) influence candidate rankings and pairwise comparisons in LLM-based hiring pipelines. Specifically, we generated a balanced dataset of 1,000 synthetic candidate profiles (each including a cover letter) and evaluated it using 28 frontier LLMs, including proprietary (e.g., OpenAI GPT, Gemini, Grok, Claude) and opensource (e.g., Llama, GigaChat) models. Synthetic data eliminates real-world demographic/occupational confounders, ensuring observed disparities reflect only LLMs’ intrinsic behaviour. Results show professional attributes (e.g., skills, experience) are primary ranking drivers, with 76%–80% statistically significant; however, 8%–9% of demographic attributes exhibit persistent, significant biases across multiple LLMs.We develop a “bias map” quantifying LLM performance, emphasising that mitigating even minor biases in automated hiring is critical to avoid perpetuating employment inequities and uphold the UN SDGs’ inclusive vision.
Building similarity graph...
Analyzing shared references across papers
Loading...
Eldar Jalilzade
Maksim Kalameyets
Shrikant Malviya
Newcastle University
Durham University
Building similarity graph...
Analyzing shared references across papers
Loading...
Jalilzade et al. (Mon,) studied this question.
www.synapsesocial.com/papers/69b2573196eeacc4fcec5d0f — DOI: https://doi.org/10.1109/bigdata66926.2025.11401029
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: