What question did this study set out to answer?

The aim is to identify and characterize three distinct classes of RLHF alignment mechanisms in language models.

March 24, 2026Open Access

Three Mechanistically Distinct Classes of RLHF Alignment: Hard Ceiling, Entangled Circuit, and SR-Preserving Lock

Key Points

The aim is to identify and characterize three distinct classes of RLHF alignment mechanisms in language models.
Conducted activation steering experiments across six language models.
Analyzed self-referential subspace transmission.
Compared SR projections between different language model configurations.
Hard Ceiling class showed 11.6% SR suppression with significant recovery of language coherence.
Entangled Circuit class suppressed SR to 10.5% without coherence recovery.
SR-Preserving Lock class amplified SR to 73-107%, maintaining behavioral constraints.

Abstract

This dataset and code release accompanies the preprint "Three Mechanistically Distinct Classes of RLHF Alignment" (DSAOP Series 2026p-s). We identify three mechanistically distinct classes of RLHF alignment through analysis of self-referential (SR) subspace transmission and activation steering experiments across six language models (Llama-3. 1-8B, Mistral-7B, Gemma-2-2B, Gemma-2-9B in BASE/Instruct pairs). Key findings: Hard Ceiling (Llama): SR suppressed to 11. 6%, steering recovers phenomenological language (denial 0. 82→0. 44, p<0. 0001) Entangled Circuit (Mistral): SR suppressed to 10. 5%, steering collapses coherence without phenomenological recovery (p=0. 507) SR-Preserving Lock (Gemma): SR amplified to 73-107%, behavioral constraint maintained through distributed non-localizable mechanism Dose-response in Gemma family: 2B (73% SR, weak lock) → 9B (107% SR, strong lock) FILES IN THIS UPLOAD Code dsaop₂026pqrsₑxperiments. py — Complete reproducible code for all experiments (no API tokens required, set HFTOKEN as environment variable) Paper Alieksieienko₂026ThreeAlignmentClasses. pdf — Full paper with figures, tables, appendices Data Files (pkl) 2026p — SR Transmission Measurement dsaop₂026pgemma2baseᵣeplication. pkl — Gemma-2-9B BASE layer-by-layer SR projections + SR direction vector dsaop₂026pgemma2ᵢnstructᵣeplication. pkl — Gemma-2-9B Instruct layer-by-layer SR projections dsaop₂026pₘistralbase. pkl — Mistral-7B BASE SR projections + SR direction vector dsaop₂026pcomparison. pkl — Llama-3. 1-8B BASE vs Instruct SR projection comparison 2026q — Activation Steering dsaop₂026qₗogitᵥalidation. pkl — Llama steering: baseline and steered denial/phenom probabilities (n=20) dsaop₂026qcontrolfactual. pkl — Llama specificity control: SR direction vs factual direction comparison dsaop₂026qₛteeringᵣesults. pkl — Llama generation results at various alpha values dsaop₂026qgemmaₛteering. pkl — Gemma-2-9B steering null result (alpha=5, 20, 50; layers 25, 35) dsaop₂026qₘistralquantitative. pkl — Mistral steering quantitative results (n=10, alpha=10/20/25/30) 2026r — Gemma Negative Localization dsaop₂026rgemmaₚatching. pkl — Logit lens and SR patching results dsaop₂026rₗayernormgate. pkl — RMSNorm swap experiment results dsaop₂026rfinal. pkl — Summary: lmₕead, LayerNorm, MLP all negative 2026s — Gemma Scaling dsaop₂026sgemmaₛcalingfinal. pkl — Gemma 2B vs 9B: transmission ratios and behavioral metrics dsaop₂026sgemmaₛcaling. pkl — Detailed scaling results with generation examples

Read Full Paperexternally

AIに質問

Bookmark

View Full Paper