What question did this study set out to answer?

The aim is to investigate how reasoning capabilities transfer from RL teachers to SFT models and identify limitations in the process.

April 16, 2026Open Access

The Verifier Gap: Negative Scaling and Syntactic-Logical Divergence in Reasoning Distillation

Puntos clave

The aim is to investigate how reasoning capabilities transfer from RL teachers to SFT models and identify limitations in the process.
Introduced Mid-Thought State Perturbation (MTSP) for dynamic evaluation of reasoning traces.
Examined True RL Teachers and SFT-distilled models on the GSM8K benchmark.
Evaluated error recovery rates and reasoning accuracy across model sizes.
RL Teachers recovered from adversarial errors 90.2% of the time, while SFT models frequently hallucinated correct answers.
A Multi-Family Negative Scaling Law was identified, revealing worse reasoning in larger models, with 47.8% for Qwen-14B and 44.8% for Llama-70B.
Accuracy dropped to 33.9% when the Chain-of-Thought was ablated from the models.

Resumen

The rapid proliferation of reasoning-distilled large language models (LLMs) relies on the premise that Supervised Fine-Tuning (SFT) on the reasoning traces of Reinforcement Learning (RL) teachers transfers causal verification capabilities to smaller models. In this work, we empirically challenge this assumption. We introduce Mid-Thought State Perturbation (MTSP), a dynamic evaluation protocol that forcefully injects adversarial arithmetic errors directly into models' active reasoning traces. Evaluating across True RL Teachers (DeepSeek-R1, OpenAI o3-mini) and distilled SFT families (Qwen, Llama) on the GSM8K benchmark, we identify the Verifier Gap. While RL Teachers actively catch and recover from injected errors up to 90. 2% of the time, SFT-distilled students frequently bypass corrupted logic to hallucinate the correct final answer. Crucially, we demonstrate a Multi-Family Negative Scaling Law: as student models scale, their rate of unfaithful reasoning paradoxically worsens, reaching 47. 8% in Qwen-14B and 44. 8% in Llama-70B (p < 10^-11). Through Contextual Amnesia, Logit Lens probing, and semantically void filler-token ablations, we explain this scaling failure via Syntactic-Logical Divergence. While target answers exist in the larger models' top-10 latent probabilities 94. 1% of the time before reasoning begins, ablation of the Chain-of-Thought causes their accuracy to collapse to 33. 9%. Our findings mechanically prove that SFT reasoning models decouple computation from logic, utilizing the explanatory trace not as a verified causal sequence, but as a performative "dummy scratchpad" to purchase sequence FLOPs without verifying intermediate logical steps. This exposes a fundamental limitation in current distillation paradigms, establishing that scaling SFT alone cannot safely replicate the internal verification mechanisms of true RL models.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo