What question did this study set out to answer?

This research aims to explore why fine-tuned language models exhibit overconfidence compared to in-context learning models on the same knowledge.

June 2, 2026Open Access

Per-Token Confidence Trajectory in LoRA Fine-Tuning: Substrate-Output Evidence for Cumulative Calibration Collapse

Key Points

This research aims to explore why fine-tuned language models exhibit overconfidence compared to in-context learning models on the same knowledge.
Conducted three experiments on Qwen2.5 base models across specific fine-tuning variants and conditions.
Measured per-token competing routes and output calibration during training across varying epochs.
Used a reactance test with format-violating prompts to evaluate model accuracy under different training regimes.
ICL outperforms LoRA-FT on cloze retrieval by 16–28 percentage points, with no accuracy benefit from FT-training.
Logarithmic measures of competing-routes compress significantly in FT regimes, indicating diminished output reliability.
ICL maintains consistent accuracy (88%) versus variable drops in FT under matching conditions.

Abstract

Why do fine-tuned language models hallucinate more confidently than their in-context-learning counterparts on the same knowledge? Two well-established techniques add new knowledge to a language model: keeping it in the prompt — in-context learning (ICL; Brown et al. 2020) — or training it into the weights, here via low-rank adapters (LoRA; Hu et al. 2021) or full-parameter fine-tuning. The deployment-cost trade-off between these techniques is widely studied. Less well-understood: why FT-trained models hallucinate more confidently than their ICL counterparts on the same knowledge — and what the substrate-mechanism behind this difference implies for the retrieval-augmented-generation-vs-fine-tuning debate that has dominated practitioner discussion. Mechanism. We trace the difference to a substrate-output mechanism: each backward pass under cross-entropy loss amplifies the winning route asymmetrically and presses alternatives below the noise floor at the output layer; the depth of this compression scales with how many gradient passes the substrate has absorbed. ICL preserves the model's calibrated distribution over candidate answers because no weight update has compressed it. FT compresses it as a structural consequence of cumulative gradient pressure, regardless of training-data content. The per-token competing-routes (CR) signal we use as the primary observable makes this compression directly measurable at the substrate-output layer. The mechanism generalises a previously RLHF-specific finding (Pødenphant Lund 2026b §3) to all weight-update training — including plain LoRA fine-tuning on innocuous factual data, and full-parameter FT. Empirical findings. Across three experiments on a 47-fact invented knowledge domain (Zorbetik), Qwen2. 5 base models at 3B and 7B scales, and fine-tuning budgets from 5 to 100 epochs across LoRA, full-parameter, and paraphrase-augmented variants, we document the mechanism: ICL outperforms LoRA-FT on cloze retrieval by 16–28 percentage points across capacity scales — a gap robust to evaluation-set sampling. FT-trained models confer no application-accuracy benefit over the no-context baseline despite encoding the facts. The per-token competing-routes signal collapses monotonically with cumulative gradient passes: log (CRₚos0) climbs from 5. 46 in ICL to 17. 85 in raw-FT 30ep to 21. 12 in paraphrase-augmented FT 30ep. Entropy at position 0 collapses from 0. 32 (ICL) to ≈0. 00 (any FT regime), independent of training-data variation. A compute-matched 2×2 control (LoRA-7B, 12 conditions, 3 seeds) disentangles data-variation from cumulative gradient-pressure: gradient-pressure dominates as the driver of compression; paraphrase confers a measurable but smaller second-order effect. A full-parameter vs LoRA ablation on Qwen2. 5-3B (30 runs) confirms the mechanism is not LoRA-specific — full-parameter FT shows the same monotonic CR-collapse with comparable amplitude. A reactance test at n=300 per cell against format-violating prompts (Q paraphrase-FT drops 18 percentage points; raw-FT drops 42. Paraphrase-FT shows measurably less reactance than raw-FT on both axes, a third independent confirmation of the distributed-amplification reading. Among the 32 conditions that reach 88% cloze accuracy by 15 epochs, Expected Calibration Error settles in a tight 0. 118–0. 120 band while log (CR) ranges 11. 56–19. 29 — a 7. 7 log-unit spread that standard calibration metrics cannot see. The substrate-output observable extends the discriminative range past the accuracy-and-calibration plateau. Applied consequences. Why FT-trained models hallucinate more confidently: alternative routes have been compressed beyond reach. Why agentic systems can't reliably represent uncertainty when built on FT: the substrate-level signal that would carry uncertainty is gone. The substrate-level distinction this paper establishes offers a principled resolution of the RAG-vs-fine-tuning debate: RAG operates in ICL-mode and preserves calibration; FT compresses it as a structural consequence — the choice between them is architectural, not merely a deployment-cost trade-off. How long-context agentic conversations inherit ICL's calibration properties for free: each turn re-evaluates the full context with no weight update. And how a concrete hybrid ICL+FT architecture combines ICL's calibration-preservation with FT's persistence at bounded context-window cost. Scope of empirical claims. Silicon substrates only: Qwen2. 5 base models at 3B and 7B; LoRA and full-parameter FT; LoRA budgets 5–100 epochs plus paraphrase-augmented variant; one invented domain (Zorbetik) ; compute-matched 2×2 at 30 epochs; n=300 reactance test on the 7B substrate. Cross-substrate extensions (whether the architectural distinction maps onto biological memory systems) are out of scope here and developed separately. Companion papers in the Friction Theory series: Paper 0 (Behavioural Friction Theory): 10. 5281/zenodo. 19462500 Paper 1 (Friction Theory substrate): 10. 5281/zenodo. 20012655 Paper 2 (Capacity Scaling, the empirical companion): 10. 5281/zenodo. 20013491 Paper 3 (Friction-Guided Inference): 10. 5281/zenodo. 20014122 Paper 6 (Matched Friction Under Hysteresis): 10. 5281/zenodo. 20059863 Paper 10 (Race-Architecture Physics): 10. 5281/zenodo. 20014568 Paper 13 (Operational Friction Theory): 10. 5281/zenodo. 20059877 Paper 4, Paper 4B, Paper 14 (Logic as Reactance), cognitive-science companion — in preparation Data and code. Per-token logprob datasets, fine-tuning notebooks, and analysis scripts share Paper 2's companion repository: https: //github. com/tplund/friction-theory-p2-capacity-scaling (CC BY 4. 0). What's new in v3. 1 (2026-05-31). Scope narrowed to silicon substrate (cognitive-science material moves to a dedicated companion paper, in preparation). 42 new training runs added across two compute-matched designs (paraphrase 2×2 and full-parameter vs LoRA ablation) plus an n=300 reactance test (replacing the v2 n=30 pilot). Eight textual fixes integrated from external dual hostile peer-review. Bibliography: Biderman et al. 2024 ("LoRA learns less and forgets less", TMLR) added. v2 record archived at 10. 5281/zenodo. 20187352.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Tomas Pødenphant Lund (Sun,) studied this question.

synapsesocial.com/papers/6a1e730830b38c64201b63f1 https://doi.org/https://doi.org/10.5281/zenodo.20472479

Bookmark

View Full Paper