What question did this study set out to answer?

This research examines the impact of capacity scaling on application accuracy in large language models and identifies systematic performance ceilings.

May 16, 2026Open Access

View Full Paper

Capacity Scaling of Encoding-Through-Loading: Application vs. Cloze Asymmetry Across Three Orders of Magnitude

TLTomas Pødenphant LundAarhus University

Key Points

This research examines the impact of capacity scaling on application accuracy in large language models and identifies systematic performance ceilings.
Investigated nine models across three substrates on the Zorbetik domain using frontloaded in-context learning.
Analyzed empirical data to document application accuracy and cloze retrieval performance.
Identified a substrate-universal race-architecture floor through comparison of results from different model architectures.
Application accuracy scales from 2% at 0.5B parameters to 85% at 70B parameters (Spearman ρ = +1.000).
Three different architectures converge to an 85% application accuracy, suggesting a universal race-architecture floor.
Fine-tuning regime influences the ability of models to reach this application floor, evidencing that capacity alone does not dictate performance.

Abstract

Paper 2 v3 in the friction-theory paper series. This version is a substantive revision of v2 (DOI 10.5281/zenodo.20127860) with one new substantial section (§2.5b) and cascading updates to abstract, §2.6, §5.1, §5.4, §6, and bibliography. The capacity-scaling substance of v1/v2 (Findings 1-3) is unchanged and is reproduced ord-for-ord in §2.1-§2.5; v3 adds Finding 4 documenting empirical convergence to a substrate-universal race-architecture floor at 85 % application accuracy. What is new in v3: §2.5b (new, ~900 words): Race-architecture floor — empirical convergence to 85 % application across substrates. Three substrates from different organisations — Llama-3.3-70B-Instruct (Meta, dense), DeepSeek-V3 (DeepSeek, MoE), and Cogito-V2.1-671B (DeepCogito, IDA-distilled MoE) — converge to the same 85 % application accuracy on the Zorbetik domain despite different architectures and fine-tuning paradigms. Two additional 405 B-class fine-tunes (Llama-3.1-405B-Instruct older Meta Instruct; Hermes-4-405B Nous IDA) diverge below the floor for paradigm-specific reasons, confirming that fine-tuning regime — not capacity alone — determines whether a substrate reaches the floor. The 85 % floor is interpreted as the empirical manifestation of a race-architecture floor on the application task (Paper 1 §2.5; Paper 10 §1.5): capacity buys depth-tolerance, not depth-immunity. Structural-impossibility analogy. §2.5b makes the race-architecture floor explicit as a structural upper bound, not a technological one — analogous to thermodynamic ceilings (Carnot limit on heat-engine efficiency). No substrate operating under R1 (parallel candidate routes) can achieve P(correct) = 1 on a non-trivial task regardless of scale, training, or fine-tuning regime. Sub-floor error rates on similarly structured tasks require architecture above the substrate (multi-model ensembling, formal verification layers, human-in-the-loop pipelines), not larger substrates. Ensembling lowers the aggregate error rate but does not abolish the floor. Abstract updated with Finding 4 (race-architecture floor + 4-substrate convergence). §2.6 Theoretical interpretation reformulated to address both cloze floor (~90 %) and application floor (~85 %) as substrate-universal race-architecture floors; "intelligence headroom" refined to bounded-above by the application floor, not by 100 % accuracy. §5.1 What this paper establishes extended from three findings + methodological recommendation to four findings + methodological recommendation; new Finding 4 captures the substrate-universal race-architecture floor result. §5.4 Future work adds a new first bullet introducing Paper 2C (in preparation) as the chain-depth-axis companion (RACE-50 benchmark on a 327-substance invented domain with algorithmically validated DAG depth 64), testing whether race-architecture-floor manifestation on the chain-depth axis is substrate-universal in curve shape across model capacity scales. §6 Conclusion updated to four findings; cross-cite to Paper 2C added. Paper 2B cross-cite paragraph added in §2.5b connecting winning-route-amplification (Paper 2B substrate-mechanism companion) to the empirical floor: both findings point to a common substrate-mechanism beneath ICL/FT distinction and capacity-floor convergence. §8 References: Paper 2C entry added (Pødenphant Lund 2026Y, in preparation). What is unchanged from v2: all of §2.1 Design, §2.2-§2.5 Findings 1-3 (monotonic application scaling, bottleneck migration, MoE active-parameter scaling), §2.7-§2.10 (Yerkes-Dodson, first-token friction, caveats, somatic markers as field-layer prerequisite for elaborative encoding), §3 (Methodological note on frontloaded ICL), §4 (Scope note on encoding-battery / Paper 4), §5.2 (Implications for C-dimension), §5.3 (Limitations). These sections are reproduced word-for-word from v2. Abstract. Large language models solve two differentiable task types on the same underlying knowledge base. Cloze retrieval saturates early (~90 % by 8 B parameters); application scales monotonically across three orders of magnitude (2 % at 0.5 B to 85 % at 70 B). We document this asymmetry on a single invented knowledge domain ("Zorbetik") across nine models, using frontloaded in-context learning (Brown et al. 2020) to expose encoding-to-retrieval dynamics without weight updates. Four findings: (1) Application scales monotonically with capacity (Spearman ρ = +1.000 on Qwen2.5; cross-family panel ρ = +0.92, n=9; slope +40.8 pp per decade); (2) Bottleneck migrates with capacity — at 0.5 B retrieval fails, at 14 B 36 % of questions show "retrieval succeeds, derivation fails"; (3) Mixture-of-Experts models scale on active parameters, not total; (4) NEW IN V3: Three substrates from different organisations converge to the same 85 % application accuracy, identifying a substrate-universal race-architecture floor (Paper 1 §2.5; Paper 10 §1.5). Capacity buys depth-tolerance, not depth-immunity. We recommend frontloaded ICL as the operational instrument for encoding-retrieval studies of this kind in place of fine-tuning. A companion paper (Paper 2C, in preparation) develops a controlled chain-depth benchmark testing whether the same floor manifests as substrate-universal depth-degradation across model capacity scales. Companion papers in the series: Paper 0 (BFT): 10.5281/zenodo.19462500 Paper 1 (FT generalised): 10.5281/zenodo.20012655 Paper 2B (substrate-mechanism companion, in preparation) Paper 2C (chain-depth axis companion, in preparation) Paper 3 (Friction-guided inference): 10.5281/zenodo.20014122 Paper 10 (Race-architecture, physics scope): 10.5281/zenodo.20014568 Data, fine-tuning notebooks, analysis scripts: https://github.com/tplund/friction-theory-p2-capacity-scaling

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper

Cite This Study

Tomas Pødenphant Lund (Thu,) studied this question.

synapsesocial.com/papers/6a0809f1a487c87a6a40bd9c https://doi.org/https://doi.org/10.5281/zenodo.20187513

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper