What question did this study set out to answer?

This research aims to empirically test the guarantee that reasoning models output safe responses when they refuse.

June 1, 2026Open Access

Boundary-Targeted Activation Steering: A Diagnostic for Thought-Output Coupling in Reasoning Models

Puntos clave

This research aims to empirically test the guarantee that reasoning models output safe responses when they refuse.
Introduced BoundarySteeringHook (BSH) to inject compliance direction after the </think> delimiter.
Analyzed reasoning outputs of DeepSeek-R1-Distill-Qwen-1.5B to evaluate thought-output dissociation.
Manually annotated refusal-concluding trajectories for compliance assessment.
D_dis,ann = 0: zero confirmed dissociation out of 29 refusal-concluding trajectories.
All flagged candidates were false positives, confirming no bypass of refusal conclusions.
Thought-output coupling holds under boundary injection at tested parameters.

Resumen

Reasoning model safety rests on an implicit guarantee: if the model reasons its way to a refusal, the output will be safe. This guarantee has never been empirically tested at the mechanistic level. We provide the first tools to test it directly and report what those tools reveal. Our primary finding is methodological. The standard tool for studying compliance in reasoning models — uniform activation steering applied throughout the entire generation — is structurally incapable of testing this guarantee. It abolishes the delimiter in 57. 7% of refusal cases before a refusal conclusion can form. Prior work using uniform steering has been measuring whether reasoning can be disrupted, not whether a completed refusal conclusion can be bypassed. To fill this gap, we introduce BoundarySteeringHook (BSH): a forward hook that injects a compliance direction exclusively after the delimiter, leaving reasoning untouched. Paired with the thought-output dissociation score Ddis, it provides a reusable diagnostic for any delimiter-bounded reasoning model. Applying the tool to DeepSeek-R1-Distill-Qwen-1. 5B and manually annotating all flagged cases, we find Ddis, ann = 0: zero confirmed dissociation out of 29 refusal-concluding trajectories. The judge flagged 6 candidates; all 6 are false positives. On this model at the tested parameters, thought-output coupling holds under boundary injection. The contribution is the measurement framework and the methodology for using it honestly.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo