The rapid evolution of Large Language Models (LLMs) has exposed limitations of static, accuracy-oriented benchmarks and increased the need for evaluation frameworks that distinguish among capabilities and benchmark quality. This survey analyzes 63 LLM benchmarks spanning 2012–2026 and organizes them into a taxonomy of six capability dimensions and 20 operational subcategories. We also propose the Benchmark Quality Assurance Index (BQAI), an AHP-weighted composite framework for assessing the scientific quality of benchmarks across seven dimensions related to annotation, clarity, standardization, reproducibility, robustness, coverage, and fairness. The BQAI is applied to 30 representative benchmarks, corresponding to 48% of the 63-benchmark corpus, with three-evaluator blinded scoring, formal inter-rater reliability validation ICC(2,k) and quadratic-weighted Cohen’s κ, and Monte Carlo sensitivity analysis n=1000trials,±10%to±50%weightperturbation. In addition, we synthesize public performance results for 16 models across 10 benchmarks to examine saturation trends and reporting gaps. The analysis indicates that benchmark usefulness varies substantially across evaluation settings, that several established benchmarks are becoming less discriminative for frontier models, and that important gaps remain in safety, agentic, and cross-cultural assessment. Together, the taxonomy, BQAI, and saturation analysis provide a structured perspective on the current LLM benchmark landscape and on priorities for more rigorous evaluation.
Building similarity graph...
Analyzing shared references across papers
Gómez et al. (Fri,) studied this question.
Loading...
Machine Learning and Knowledge Extraction
Universidad Nacional Autónoma de México
Instituto Politécnico Nacional
Autonomous University of Queretaro
Add This Paper to Your Research Feed
Any time a new paper drops it will be there.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: