The rapid evolution of Large Language Models (LLMs) has exposed limitations of static, accuracy-oriented benchmarks and increased the need for evaluation frameworks that distinguish among capabilities and benchmark quality. This survey analyzes 63 LLM benchmarks spanning 2012–2026 and organizes them into a taxonomy of six capability dimensions and 20 operational subcategories. We also propose the Benchmark Quality Assurance Index (BQAI), an AHP-weighted composite framework for assessing the scientific quality of benchmarks across seven dimensions related to annotation, clarity, standardization, reproducibility, robustness, coverage, and fairness. The BQAI is applied to 30 representative benchmarks, corresponding to 48% of the 63-benchmark corpus, with three-evaluator blinded scoring, formal inter-rater reliability validation ICC(2,k) and quadratic-weighted Cohen’s κ, and Monte Carlo sensitivity analysis n=1000trials,±10%to±50%weightperturbation. In addition, we synthesize public performance results for 16 models across 10 benchmarks to examine saturation trends and reporting gaps. The analysis indicates that benchmark usefulness varies substantially across evaluation settings, that several established benchmarks are becoming less discriminative for frontier models, and that important gaps remain in safety, agentic, and cross-cultural assessment. Together, the taxonomy, BQAI, and saturation analysis provide a structured perspective on the current LLM benchmark landscape and on priorities for more rigorous evaluation.
Building similarity graph...
Analyzing shared references across papers
Loading...
Rubén Gómez
Instituto Politécnico Nacional
Carlos E. Miranda
Autonomous University of Queretaro
Julio-Alejandro Romero-González
Autonomous University of Queretaro
Machine Learning and Knowledge Extraction
Universidad Nacional Autónoma de México
Instituto Politécnico Nacional
Autonomous University of Queretaro
Building similarity graph...
Analyzing shared references across papers
Loading...
Gómez et al. (Fri,) studied this question.
synapsesocial.com/papers/6a12962948a0ea1665672b5b — DOI: https://doi.org/10.3390/make8060141
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: