What question did this study set out to answer?

This research aims to assess the capabilities and quality of large language model benchmarks through a structured framework.

May 24, 2026Open Access

Large Language Model Benchmarks: A Taxonomy of Capabilities, Scientific Quality Assessment, and Saturation Analysis

Read Full Paperexternally

Key Points

This research aims to assess the capabilities and quality of large language model benchmarks through a structured framework.
Analyzed 63 language model benchmarks from 2012 to 2026, categorizing them into six capability dimensions and 20 subcategories.
Developed the Benchmark Quality Assurance Index (BQAI) to evaluate 30 representative benchmarks based on seven scientific quality dimensions.
Conducted analysis of public performance results for 16 models across 10 benchmarks to identify saturation trends and reporting gaps.
Variation in benchmark usefulness across different evaluation settings is noted, revealing established benchmarks' declining discriminative power for cutting-edge models.
Highlighted gaps in safety, agentic, and cross-cultural assessments, indicating areas needing improvement.
Achieved formal validation of inter-rater reliability with ICC(2,k) and quadratic-weighted Cohen’s κ.

Abstract

The rapid evolution of Large Language Models (LLMs) has exposed limitations of static, accuracy-oriented benchmarks and increased the need for evaluation frameworks that distinguish among capabilities and benchmark quality. This survey analyzes 63 LLM benchmarks spanning 2012–2026 and organizes them into a taxonomy of six capability dimensions and 20 operational subcategories. We also propose the Benchmark Quality Assurance Index (BQAI), an AHP-weighted composite framework for assessing the scientific quality of benchmarks across seven dimensions related to annotation, clarity, standardization, reproducibility, robustness, coverage, and fairness. The BQAI is applied to 30 representative benchmarks, corresponding to 48% of the 63-benchmark corpus, with three-evaluator blinded scoring, formal inter-rater reliability validation ICC(2,k) and quadratic-weighted Cohen’s κ, and Monte Carlo sensitivity analysis n=1000trials,±10%to±50%weightperturbation. In addition, we synthesize public performance results for 16 models across 10 benchmarks to examine saturation trends and reporting gaps. The analysis indicates that benchmark usefulness varies substantially across evaluation settings, that several established benchmarks are becoming less discriminative for frontier models, and that important gaps remain in safety, agentic, and cross-cultural assessment. Together, the taxonomy, BQAI, and saturation analysis provide a structured perspective on the current LLM benchmark landscape and on priorities for more rigorous evaluation.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Rubén Gómez

Instituto Politécnico Nacional

Carlos E. Miranda

Autonomous University of Queretaro

Julio-Alejandro Romero-González

Autonomous University of Queretaro

Journals

Machine Learning and Knowledge Extraction

Actions

Institutions

Universidad Nacional Autónoma de México

Instituto Politécnico Nacional

Autonomous University of Queretaro

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Large Language Model Benchmarks: A Taxonomy of Capabilities, Scientific Quality Assessment, and Saturation Analysis

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Also consider