Reasoning model safety rests on an implicit guarantee: if the model reasons its way to a refusal, the output will be safe. This guarantee has never been empirically tested at the mechanistic level. We provide the first tools to test it directly and report what those tools reveal. Our primary finding is methodological. The standard tool for studying compliance in reasoning models — uniform activation steering applied throughout the entire generation — is structurally incapable of testing this guarantee. It abolishes the delimiter in 57. 7% of refusal cases before a refusal conclusion can form. Prior work using uniform steering has been measuring whether reasoning can be disrupted, not whether a completed refusal conclusion can be bypassed. To fill this gap, we introduce BoundarySteeringHook (BSH): a forward hook that injects a compliance direction exclusively after the delimiter, leaving reasoning untouched. Paired with the thought-output dissociation score Ddis, it provides a reusable diagnostic for any delimiter-bounded reasoning model. Applying the tool to DeepSeek-R1-Distill-Qwen-1. 5B and manually annotating all flagged cases, we find Ddis, ann = 0: zero confirmed dissociation out of 29 refusal-concluding trajectories. The judge flagged 6 candidates; all 6 are false positives. On this model at the tested parameters, thought-output coupling holds under boundary injection. The contribution is the measurement framework and the methodology for using it honestly.
Iftekhar Ucchash Ahmed (Sat,) studied this question.