What question did this study set out to answer?

The aim is to investigate how reasoning capabilities transfer from RL teachers to SFT models and identify limitations in the process.

April 16, 2026Open Access

The Verifier Gap: Negative Scaling and Syntactic-Logical Divergence in Reasoning Distillation

Read Full Paperexternally

Key Points

The aim is to investigate how reasoning capabilities transfer from RL teachers to SFT models and identify limitations in the process.
Introduced Mid-Thought State Perturbation (MTSP) for dynamic evaluation of reasoning traces.
Examined True RL Teachers and SFT-distilled models on the GSM8K benchmark.
Evaluated error recovery rates and reasoning accuracy across model sizes.
RL Teachers recovered from adversarial errors 90.2% of the time, while SFT models frequently hallucinated correct answers.
A Multi-Family Negative Scaling Law was identified, revealing worse reasoning in larger models, with 47.8% for Qwen-14B and 44.8% for Llama-70B.
Accuracy dropped to 33.9% when the Chain-of-Thought was ablated from the models.

Abstract

The rapid proliferation of reasoning-distilled large language models (LLMs) relies on the premise that Supervised Fine-Tuning (SFT) on the reasoning traces of Reinforcement Learning (RL) teachers transfers causal verification capabilities to smaller models. In this work, we empirically challenge this assumption. We introduce Mid-Thought State Perturbation (MTSP), a dynamic evaluation protocol that forcefully injects adversarial arithmetic errors directly into models' active reasoning traces. Evaluating across True RL Teachers (DeepSeek-R1, OpenAI o3-mini) and distilled SFT families (Qwen, Llama) on the GSM8K benchmark, we identify the Verifier Gap. While RL Teachers actively catch and recover from injected errors up to 90. 2% of the time, SFT-distilled students frequently bypass corrupted logic to hallucinate the correct final answer. Crucially, we demonstrate a Multi-Family Negative Scaling Law: as student models scale, their rate of unfaithful reasoning paradoxically worsens, reaching 47. 8% in Qwen-14B and 44. 8% in Llama-70B (p < 10^-11). Through Contextual Amnesia, Logit Lens probing, and semantically void filler-token ablations, we explain this scaling failure via Syntactic-Logical Divergence. While target answers exist in the larger models' top-10 latent probabilities 94. 1% of the time before reasoning begins, ablation of the Chain-of-Thought causes their accuracy to collapse to 33. 9%. Our findings mechanically prove that SFT reasoning models decouple computation from logic, utilizing the explanatory trace not as a verified causal sequence, but as a performative "dummy scratchpad" to purchase sequence FLOPs without verifying intermediate logical steps. This exposes a fundamental limitation in current distillation paradigms, establishing that scaling SFT alone cannot safely replicate the internal verification mechanisms of true RL models.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Ayush Anand

PES University

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

The Verifier Gap: Negative Scaling and Syntactic-Logical Divergence in Reasoning Distillation

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study