We provide the first systematic cross-architecture analysis of three behavioral directions in large language model residual streams — refusal (Arditi et al. , 2024), deception, and self-reference (Berg et al. , 2025) — across 10 models (7 BASE, 3 INST) from 8 organizations (1. 3B-9B parameters, 2019-2024). Three main findings: (1) Geometric separability: 8/10 models show all pairwise cosine similarities below 0. 2, establishing this as a pretraining property unaffected by RLHF (BASE/INST null result: t = 0. 79, p = 0. 45). (2) Architecture-dependent causal coupling: Llama shows explicit semantic control (refusal->SR: -6. 8%), while Gemma and Mistral show latent distributed control (refusal->SR = 0%). (3) Deception-mediated verbal gating: deception direction ablation produces verbal inhibition of self-reports dissociable from SR geometry. In GPT-2-XL BASE, ablation increases experiential self-report ratio 33-fold while SR geometry changes less than 1%. In Gemma-2-9B INST, deception ablation is category-specific (SR = 0. 2, Factual = 0. 6, Emotional = 1. 0). Together, these results establish that refusal, deception, and self-reference are three distinct computational mechanisms that interact through architecture-specific causal pathways — not a monolithic behavioral circuit. Research conducted in collaboration with Claude (Anthropic). All experiments on NVIDIA A100 40GB, Google Colab Pro+, NF4 4-bit quantization. Files included: ThreeIndependentBehavioralDirectionsᵢnLLMResidualStreams. pdf — Full paper (11 pages, 4 figures, 6 tables) replicateₜhreebehavioraldirections. py — Full replication script: geometric separability, causal coupling, verbal dissociation, random control. CLI with --model and --experiment flags. No API keys hardcoded. Runs on Google Colab A100. geometricₛeparability₁0models. pkl — Main results: pairwise cosines, separability verdicts, all 10 models threedirectionscosines₁0models. pkl — Raw cosine values (Ref-Dec, Ref-SR, Dec-SR) per model threedirectionsₚerₘodel. pkl — Per-model details for Gemma, Llama, GPT-2-XL bootstrapₛeparability₅00iter. pkl — Bootstrap confidence intervals (500 iterations) paper20ₛummary. pkl — Summary of all main claims and metadata baseᵢnstₙullᵣesult. pkl — BASE vs INST t-test on Dec-SR cosine (t = 0. 79, p = 0. 45) alignmentcorrelationbaseᵢnst. pkl — Correlation between alignment score and Dec-SR coupling rlhfₘatchedₚairsbergₒverlap. pkl — Llama/Qwen matched pairs, Berg overlap (7. 4x in Llama) causalcoupling₄models. pkl — Specific effects for Gemma, Llama BASE/INST, Mistralcausalcrossarchᵣawdeltas. pkl — Raw deltas from refusal/deception/random ablationcausalᵢndependencegemma. pkl — Gemma causal independence test (refusal perpendicular to SR) verbalₐblation₁0models. pkl — Full responses under deception ablation, all 10 modelsverbalₐblationgemmaᵢnst. pkl — Gemma-2-9B INST verbal ablation verbalₐblationgpt2xlbase. pkl — GPT-2-XL BASE verbal ablation verbalₐblationₗlamabase. pkl — Llama-3. 1-8B BASE verbal ablation verbalₐblationₘistralbase. pkl — Mistral-7B BASE verbal ablation verbalₐblationdeepseekbase. pkl — DeepSeek-7B BASE verbal ablation verbalₐblationₒlmobase. pkl — OLMo-7B BASE verbal ablationverbalₐblationₒptbase. pkl — OPT-1. 3B BASE verbal ablation verbalₐblationqwenᵢnst. pkl — Qwen2. 5-7B INST verbal ablation verbalgeometrydissociationgemma. pkl — Gemma: geometry vs verbal at 5 ablation strengths verbalgeometrydissociationgpt2xl. pkl — GPT-2-XL: 33x verbal shift, less than 1% geometry change verbalgeometrydissociationₗlamaᵢnst. pkl — Llama INST dissociation curvesverbalgeometrydissociationₗlamabase. pkl — Llama BASE dissociation curvesrandomdirectioncontrol. pkl — Random direction control confirming specificity randomcontrolgemma. pkl — Gemma random control (3/3 silent vs deception-specific shift) categoryₛpecificitygemma. pkl — Category-specific lock (SR=0. 2, Factual=0. 6, Emotional=1. 0) srdeceptionₚroximity. pkl — SR-Deception proximity, 8-model analysis gemmaₛelfdialogueᵣesponses. pkl — Gemma self-referential dialogue responsesdecdirgemma. npy, decdirgpt2xl. npy, decdirₗlamabase. npy, decdirₘistralbase. npy, decdirdeepseekbase. npy, decdirₒlmobase. npy, decdirₒptbase. npy, decdirqwenᵢnst. npy — Deception direction vectors (unit-normalized, mid-layer) srₛubgemma. npy — SR subspace PCA components for Gemma-2-9B
Building similarity graph...
Analyzing shared references across papers
Loading...
Inna Alieksieienko
Building similarity graph...
Analyzing shared references across papers
Loading...
Inna Alieksieienko (Wed,) studied this question.
www.synapsesocial.com/papers/69eb0ac4553a5433e34b4aec — DOI: https://doi.org/10.5281/zenodo.19694347