General-purpose large language model (LLM) benchmarks such as MMLU, HumanEval, and GPQA Diamond have entered a regime in which the leading models cluster within two to four percentage points of one another, and in which substantial evidence of training-data contamination undermines the interpretability of the remaining gaps. In response, the evaluation ecosystem has produced a rapidly growing tail of niche benchmarks: domain-vertical evaluations such as MedQA, LegalBench, FinBen, BridgeBench, and PIF-Bench; and indie/informal evaluations such as SkateBench and SnitchBench. The dominant framing in the recent literature treats this proliferation as fragmentation to be cleaned up. We argue the opposite: niche benchmarks are a structurally necessary correction to the saturation, contamination, and Goodhart-style overfitting of general-capability leaderboards, and they generate value in two directions at once. For practitioners, they make model selection a tractable, evidence-based decision tied to a specific deployment context. For frontier laboratories, they provide a discriminating signal precisely where general benchmarks no longer separate models, and they surface failure modes (geographic recall gaps, temporal displacement of facts, edge-case adaptation, sensitive-topic handling) that aggregate accuracy metrics obscure. We propose a three-tier taxonomy of niche benchmarks (academic-formal, industry-vertical, and indie-informal), articulate seven design principles for constructing one, and present three short vignettes—SkateBench, SnitchBench, and PIF-Bench—that illustrate how benchmarks across radically different scopes and methodological styles each surface failure modes invisible to general evaluation. We address the standard objections to benchmark proliferation (fragmentation, quality variance, evaluator conflicts, cultural bias) and argue that the appropriate response is not consolidation but better infrastructure for documentation (e.g., benchmark cards), portfolio-level evaluation, and meta-analysis across the tail.
Solomon Shalom Lijo (Sun,) studied this question.