Our study reports a proof-of-concept evaluation of Large Language Models (LLMs) as additional annotators within a production data annotation workflow. We analyze 12,745 social media posts across two taxonomies: Funnel stage and Intent. We evaluate two models (GPT-5, GPT-5 Mini) under two prompting strategies, zero-shot and few-shot. We benchmark human–LLM (HL) and gold–LLM (GL) disagreement against empirical human–human (HH) disagreement baselines (50.5% overall). Our quantitative analysis goes beyond aggregate rates by giving an in-depth insight into disagreement patterns in the form of confusion matrices, revealing different systematic biases in humans and AI. Qualitative assessment further indicates that while LLMs effectively surface human omissions, their disagreements are more often “unhelpful” than those between humans. Ultimately, this work provides a methodological foundation for integrating LLMs as collaborative agents that enhance data reliability, rather than simply automating annotation at scale.
Almady-Palotai et al. (Wed,) studied this question.