What question did this study set out to answer?

This research aims to systematically benchmark large language models for 5W1H information extraction from Spanish news articles.

February 5, 2026Open Access

Benchmarking LLM-as-a-Judge Models for 5W1H Extraction Evaluation

Key Points

This research aims to systematically benchmark large language models for 5W1H information extraction from Spanish news articles.
Benchmarking multiple large language models including GPT, Claude, and Gemini.
Evaluating models based on six quality criteria: Factual Accuracy, Completeness, Relevance and Conciseness, Clarity and Readability, Faithfulness to Source, and Overall Coherence.
Analyzing inter-judge agreement and score distribution patterns across two Spanish-language corpora.
Conducting meta-evaluation with human experts to compare LLM evaluations against journalistic judgment.
All models achieved alignment levels above 90% across all metrics.
Claude Sonnet 4.5 was identified as the most accurate evaluator with a Global Judgment Acceptance Rate of 99.79%.
Substantial inter-annotator agreement of κ=0.6739 was noted in meta-evaluation.

Abstract

Evaluating 5W1H (Who, What, When, Where, Why, and How) information extraction systems remains challenging, as traditional information retrieval metrics like ROUGE and BLEU fail to capture semantic accuracy and narrative coherence. The LLM-as-a-Judge paradigm offers a promising alternative, yet systematic comparisons of judge models for this task are lacking. This study benchmarks multiple large language models, including state-of-the-art models such as GPT, Claude, and Gemini as evaluators of 5W1H extractions from Spanish news articles. We assess judge performance across six quality criteria: Factual Accuracy, Completeness, Relevance and Conciseness, Clarity and Readability, Faithfulness to Source, and Overall Coherence. Our analysis examines inter-judge agreement, score distribution patterns, criterion-level variance, and the relationship between evaluation quality and computational cost. Using two Spanish-language corpora (BASSE and FLARES), we identify which criteria exhibit consistent cross-model agreement and which prove most sensitive to judge selection. The main contribution of this work is providing the first systematic benchmark of LLM-as-a-Judge models for 5W1H extraction evaluation in Spanish, validated against expert journalistic judgment. Results reveal that all evaluated models achieve alignment levels above 90% across all metrics. Specifically, Claude Sonnet 4.5 emerges as the most accurate evaluator with a Global Judgment Acceptance Rate (JAR) of 99.79%. Furthermore, meta-evaluation with human experts demonstrates a substantial inter-annotator agreement of κ=0.6739. Finally, we provide recommendations for judge model selection based on task requirements and resource constraints, contributing practical guidance for researchers implementing LLM-based evaluation pipelines for information extraction tasks.

Read Full Paperexternally

AI से पूछें

Bookmark

View Full Paper

Cite This Study

Cassola-Bacallao et al. (Tue,) studied this question.

synapsesocial.com/papers/698435aaf1d9ada3c1fb4bc7 https://doi.org/https://doi.org/10.3390/electronics15030659

AI से पूछें

Bookmark

View Full Paper