What question did this study set out to answer?

This research aims to explore the supervision tradeoff in post-training for a vision-language model, assessing performance across different regimes.

April 27, 2026Open Access

The Supervision Tradeoff: Format Scaffolds, Judgment Pleasing, and Anti-Calibration in Post-Training

Key Points

This research aims to explore the supervision tradeoff in post-training for a vision-language model, assessing performance across different regimes.
Evaluated across three preference-aware training regimes: naive SFT, structural-signal ORPO, and surface-signal ORPO.
Jointly assessed by a deterministic verifier and a cross-lineage judge panel using a 953-prompt out-of-distribution corpus.
Conducted a pre-registered 3-seed confirmatory replication to validate results.
The untrained Base model outperformed fine-tuned variations in judge-panel performance with a significant difference (0.076 normalized Borda points).
Surface-signal ORPO resulted in anti-calibration, with Brier skill score at -0.451, indicating poorer predictions than the baseline.
Inconsistencies appeared across evaluators regarding completeness versus correctness, highlighting the complexity in judging outputs.

Abstract

Target venue: NeurIPS 2026 Evaluations 0. 232 from Base to the top structural-ORPO arm. The judges converge strongly on factual-accuracy ranking (avg Spearman ρ ≈ 0. 873) yet diverge on definitive-winner identity (Fleiss' κ ≈ 0. 394) — LLM-as-judge is reliable for objective dimensions, unstable for holistic-quality verdicts. Format Compulsion (SFT-side mechanism): structural-signal SFT trades verifier accuracy for judge-panel rank, but only at single seed. The Gen-1-continue arm at single seed scored 34. 6% verifier vs Base's 32. 0% (+2. 6pp), at the cost of dropping to Borda-5th of 8. The verifier gain is benchmark-selective (HumanEval+ +7. 0pp, MMLU-Pro +2. 8pp, MATH-500 −1. 0pp) ; the Borda loss is consistent with the model emitting long structurally-correct traces that frontier judges read as repetitive or pedantic. We name this format compulsion: structural template internalization that improves syntactic correctness on grounded benchmarks while penalizing the model on holistic preference axes that reward concision. Calibration Compulsion and Epistemic Decoys (surface-ORPO mechanism): training the model with preference pairs that differ only by an appended confidence footer does not produce calibration. It produces anti-calibration: Brier 0. 296 against an empirical-base-rate predictor's 0. 204 on N=531 verifier-grounded emitted-footer prompts. The Brier skill score is −0. 451 — the trained calibrator is 45. 1% worse than predicting the dataset base rate; all three calibration-token-to-probability mappings (standard, hedged, extreme) yield negative skill scores. The footer is emitted task-conditionally (high on multiple-choice MMLU-Pro, low on long-form MATH) rather than knowledge-conditionally. We call this calibration compulsion: surface-signal ORPO learns to satisfy the format of calibration without grounding in internal task-state. The trained confidence scores are epistemic decoys — an instance of the RLHF feedback loop in which judges rewarding confident-sounding language produce models that confidently misrepresent their own uncertainty. Density walk-back (scope refinement of prior work): Group A (sparse captions) and Group B (dense Reasoning-NEST records, 11. 1× per-record token density) are statistically indistinguishable on 953-prompt OOD verifier-grounded accuracy (32. 9% vs 32. 9%) and on 4-judge Borda (0. 566 vs 0. 575). At this corpus scale (459 admitted training records) and base size (11B), per-record density does not transfer to OOD reasoning. The Density Imperative's in-domain claims are not falsified; the OOD-transfer extrapolation is. Breadth of task coverage may matter more than per-record density at small corpus scales. The Replication Crisis, Demonstrated (a new methodological standard): a pre-registered 3-seed confirmatory replication (seeds 42, 1337, 2026) of the fp16-matched SFT-vs-ORPO pair does not reproduce the single-seed result and inverts the sign of the paired contrast. The original Gen-1-continue 34. 6% does not recover (3-seed range 20. 9%, 23. 3%, mean 22. 19%, std 1. 21pp). The paired SFT−ORPO contrast becomes −4. 25pp in ORPO's favor on the verifier-grounded intersection (N=361 events identical across all six runs), 95% paired t-CI −8. 93, +0. 44, crosses zero. Post-training verifier accuracy on this base exhibits ~11pp seed-to-seed variance on the SFT arm at matched hyperparameters — larger than any inter-regime difference the original paper took to be informative. The single-seed “SFT wins the verifier” claim is withdrawn in favor of the methodological finding itself: post-training claims at the 11B-parameter scale are statistically fragile under single-seed reporting. Rubric-weighting disagreement specimen (§10 meta-evaluation): a single MATH-500 Level 5 reflection event in the 953-prompt sweep documents that three of four frontier judges (Claude, GPT, Grok) ranked Base output first — complete but arithmetically wrong — over Gen-1's truncated-but-geometrically-correct output. The fourth judge (Gemini) ranked Gen-1 first by weighting correctness-of-direction over completeness. Both rubric weightings are defensible; the disagreement is the methodological finding. “Quality of reasoning” is a multi-dimensional construct in which completeness, method-recognition, and correctness can be in tension — a finding directly relevant to LLM-as-judge position-bias and verbosity-bias literature, and to the design of any aggregate ranking that compresses multi-dimensional rubrics into a single ordinal. Truncation audit and scope of confounded sub-findings: at the 256-token inference ceiling, HumanEval+ and MMLU-Pro carry <1. 5% truncation across all arms (the +7. 0pp SFT HumanEval+ gain and the −7. 6pp ORPO HumanEval+ regression are un-confounded), while MATH-500 Level 5 truncation varies 25–49% across arms (Base 25. 7%, SFT 25. 7%, ORPO 6. 4%, Group B 48. 6%). MATH-500 sub-findings are scoped as confounded; a 200-prompt re-run at 1024 tokens on A100 is pre-registered for the revision cycle. Evaluative integrity and E this paper continues The Density Imperative (10. 5281/zenodo. 18667735) and Cognitive Nutrition (10. 5281/zenodo. 18667742). Reproducibility codebase including all training, inference, judging, and analysis scripts is released at github. com/codex-curator/golden-codex-pipeline at commit f9119b1 (initial public release). SHA-256 of v5. 1 PDF: 6ff2c08693885f8c3d9578288a93e2defda2467152f2274d6798cca6443fe263. Hugging Face Hub release (paper + corpus + outputs + judgments + Round-2 seed replication, loadable via datasets. loaddataset): huggingface. co/datasets/Metavolve-Labs/supervision-tradeoff.

Read Full Paperexternally

Perguntar à IA

Bookmark

View Full Paper