What question did this study set out to answer?

This study aims to evaluate the effectiveness of LLMs as annotators in a data annotation workflow on social media posts.

June 19, 2026Open Access

AI-in-the-Loop: Scaling Brand Visibility through LLM-Augmented Annotation

Puntos clave

This study aims to evaluate the effectiveness of LLMs as annotators in a data annotation workflow on social media posts.
Analyzed 12,745 social media posts categorized by Funnel stage and Intent.
Evaluated two models (GPT-5 and GPT-5 Mini) using zero-shot and few-shot prompting strategies.
Compared human-LLM and gold-LLM disagreement against human-human disagreement baselines.
Overall human–human disagreement rate was 50.5%.
LLMs surfaced human omissions but yielded more 'unhelpful' disagreements compared to humans.
Quantitative analysis revealed distinct systematic biases in both human and AI annotators.

Resumen

Our study reports a proof-of-concept evaluation of Large Language Models (LLMs) as additional annotators within a production data annotation workflow. We analyze 12,745 social media posts across two taxonomies: Funnel stage and Intent. We evaluate two models (GPT-5, GPT-5 Mini) under two prompting strategies, zero-shot and few-shot. We benchmark human–LLM (HL) and gold–LLM (GL) disagreement against empirical human–human (HH) disagreement baselines (50.5% overall). Our quantitative analysis goes beyond aggregate rates by giving an in-depth insight into disagreement patterns in the form of confusion matrices, revealing different systematic biases in humans and AI. Qualitative assessment further indicates that while LLMs effectively surface human omissions, their disagreements are more often “unhelpful” than those between humans. Ultimately, this work provides a methodological foundation for integrating LLMs as collaborative agents that enhance data reliability, rather than simply automating annotation at scale.

Leer artículo completoexternamente

Preguntar a la IA

Me gusta

Guardar

Ver artículo completo