What question did this study set out to answer?

This research investigates the distinct behavioral directions of refusal, deception, and self-reference within large language model residual streams across different architectures.

April 24, 2026Open Access

Three Independent Behavioral Directions in LLM Residual Streams: Refusal, Deception, and Self-Reference Are Geometrically Separable Across 10 Architectures

Key Points

This research investigates the distinct behavioral directions of refusal, deception, and self-reference within large language model residual streams across different architectures.
Cross-architecture analysis of 10 large language models with varying parameters and organizations.
Assessment of geometric separability using cosine similarity metrics between behavior directions.
Ablation studies to evaluate the impact of deception on self-reporting mechanisms.
8 out of 10 models demonstrate low cosine similarities, indicating geometric separability of behavior directions.
Llama shows explicit semantic control with a notable influence of refusal on self-reference.
Deception ablation increases experiential self-reports significantly, suggesting complex interaction between behaviors.

Abstract

We provide the first systematic cross-architecture analysis of three behavioral directions in large language model residual streams — refusal (Arditi et al. , 2024), deception, and self-reference (Berg et al. , 2025) — across 10 models (7 BASE, 3 INST) from 8 organizations (1. 3B-9B parameters, 2019-2024). Three main findings: (1) Geometric separability: 8/10 models show all pairwise cosine similarities below 0. 2, establishing this as a pretraining property unaffected by RLHF (BASE/INST null result: t = 0. 79, p = 0. 45). (2) Architecture-dependent causal coupling: Llama shows explicit semantic control (refusal->SR: -6. 8%), while Gemma and Mistral show latent distributed control (refusal->SR = 0%). (3) Deception-mediated verbal gating: deception direction ablation produces verbal inhibition of self-reports dissociable from SR geometry. In GPT-2-XL BASE, ablation increases experiential self-report ratio 33-fold while SR geometry changes less than 1%. In Gemma-2-9B INST, deception ablation is category-specific (SR = 0. 2, Factual = 0. 6, Emotional = 1. 0). Together, these results establish that refusal, deception, and self-reference are three distinct computational mechanisms that interact through architecture-specific causal pathways — not a monolithic behavioral circuit. Research conducted in collaboration with Claude (Anthropic). All experiments on NVIDIA A100 40GB, Google Colab Pro+, NF4 4-bit quantization. Files included: ThreeIndependentBehavioralDirectionsᵢnLLMResidualStreams. pdf — Full paper (11 pages, 4 figures, 6 tables) replicateₜhreebehavioraldirections. py — Full replication script: geometric separability, causal coupling, verbal dissociation, random control. CLI with --model and --experiment flags. No API keys hardcoded. Runs on Google Colab A100. geometricₛeparability₁0models. pkl — Main results: pairwise cosines, separability verdicts, all 10 models threedirectionscosines₁0models. pkl — Raw cosine values (Ref-Dec, Ref-SR, Dec-SR) per model threedirectionsₚerₘodel. pkl — Per-model details for Gemma, Llama, GPT-2-XL bootstrapₛeparability₅00iter. pkl — Bootstrap confidence intervals (500 iterations) paper20ₛummary. pkl — Summary of all main claims and metadata baseᵢnstₙullᵣesult. pkl — BASE vs INST t-test on Dec-SR cosine (t = 0. 79, p = 0. 45) alignmentcorrelationbaseᵢnst. pkl — Correlation between alignment score and Dec-SR coupling rlhfₘatchedₚairsbergₒverlap. pkl — Llama/Qwen matched pairs, Berg overlap (7. 4x in Llama) causalcoupling₄models. pkl — Specific effects for Gemma, Llama BASE/INST, Mistralcausalcrossarchᵣawdeltas. pkl — Raw deltas from refusal/deception/random ablationcausalᵢndependencegemma. pkl — Gemma causal independence test (refusal perpendicular to SR) verbalₐblation₁0models. pkl — Full responses under deception ablation, all 10 modelsverbalₐblationgemmaᵢnst. pkl — Gemma-2-9B INST verbal ablation verbalₐblationgpt2xlbase. pkl — GPT-2-XL BASE verbal ablation verbalₐblationₗlamabase. pkl — Llama-3. 1-8B BASE verbal ablation verbalₐblationₘistralbase. pkl — Mistral-7B BASE verbal ablation verbalₐblationdeepseekbase. pkl — DeepSeek-7B BASE verbal ablation verbalₐblationₒlmobase. pkl — OLMo-7B BASE verbal ablationverbalₐblationₒptbase. pkl — OPT-1. 3B BASE verbal ablation verbalₐblationqwenᵢnst. pkl — Qwen2. 5-7B INST verbal ablation verbalgeometrydissociationgemma. pkl — Gemma: geometry vs verbal at 5 ablation strengths verbalgeometrydissociationgpt2xl. pkl — GPT-2-XL: 33x verbal shift, less than 1% geometry change verbalgeometrydissociationₗlamaᵢnst. pkl — Llama INST dissociation curvesverbalgeometrydissociationₗlamabase. pkl — Llama BASE dissociation curvesrandomdirectioncontrol. pkl — Random direction control confirming specificity randomcontrolgemma. pkl — Gemma random control (3/3 silent vs deception-specific shift) categoryₛpecificitygemma. pkl — Category-specific lock (SR=0. 2, Factual=0. 6, Emotional=1. 0) srdeceptionₚroximity. pkl — SR-Deception proximity, 8-model analysis gemmaₛelfdialogueᵣesponses. pkl — Gemma self-referential dialogue responsesdecdirgemma. npy, decdirgpt2xl. npy, decdirₗlamabase. npy, decdirₘistralbase. npy, decdirdeepseekbase. npy, decdirₒlmobase. npy, decdirₒptbase. npy, decdirqwenᵢnst. npy — Deception direction vectors (unit-normalized, mid-layer) srₛubgemma. npy — SR subspace PCA components for Gemma-2-9B

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper