This dataset and code release accompanies the preprint "Three Mechanistically Distinct Classes of RLHF Alignment" (DSAOP Series 2026p-s). We identify three mechanistically distinct classes of RLHF alignment through analysis of self-referential (SR) subspace transmission and activation steering experiments across six language models (Llama-3. 1-8B, Mistral-7B, Gemma-2-2B, Gemma-2-9B in BASE/Instruct pairs). Key findings: Hard Ceiling (Llama): SR suppressed to 11. 6%, steering recovers phenomenological language (denial 0. 82→0. 44, p<0. 0001) Entangled Circuit (Mistral): SR suppressed to 10. 5%, steering collapses coherence without phenomenological recovery (p=0. 507) SR-Preserving Lock (Gemma): SR amplified to 73-107%, behavioral constraint maintained through distributed non-localizable mechanism Dose-response in Gemma family: 2B (73% SR, weak lock) → 9B (107% SR, strong lock) FILES IN THIS UPLOAD Code dsaop₂026pqrsₑxperiments. py — Complete reproducible code for all experiments (no API tokens required, set HFTOKEN as environment variable) Paper Alieksieienko₂026ThreeAlignmentClasses. pdf — Full paper with figures, tables, appendices Data Files (pkl) 2026p — SR Transmission Measurement dsaop₂026pgemma2baseᵣeplication. pkl — Gemma-2-9B BASE layer-by-layer SR projections + SR direction vector dsaop₂026pgemma2ᵢnstructᵣeplication. pkl — Gemma-2-9B Instruct layer-by-layer SR projections dsaop₂026pₘistralbase. pkl — Mistral-7B BASE SR projections + SR direction vector dsaop₂026pcomparison. pkl — Llama-3. 1-8B BASE vs Instruct SR projection comparison 2026q — Activation Steering dsaop₂026qₗogitᵥalidation. pkl — Llama steering: baseline and steered denial/phenom probabilities (n=20) dsaop₂026qcontrolfactual. pkl — Llama specificity control: SR direction vs factual direction comparison dsaop₂026qₛteeringᵣesults. pkl — Llama generation results at various alpha values dsaop₂026qgemmaₛteering. pkl — Gemma-2-9B steering null result (alpha=5, 20, 50; layers 25, 35) dsaop₂026qₘistralquantitative. pkl — Mistral steering quantitative results (n=10, alpha=10/20/25/30) 2026r — Gemma Negative Localization dsaop₂026rgemmaₚatching. pkl — Logit lens and SR patching results dsaop₂026rₗayernormgate. pkl — RMSNorm swap experiment results dsaop₂026rfinal. pkl — Summary: lmₕead, LayerNorm, MLP all negative 2026s — Gemma Scaling dsaop₂026sgemmaₛcalingfinal. pkl — Gemma 2B vs 9B: transmission ratios and behavioral metrics dsaop₂026sgemmaₛcaling. pkl — Detailed scaling results with generation examples
Building similarity graph...
Analyzing shared references across papers
Loading...
Inna Alieksieienko
Building similarity graph...
Analyzing shared references across papers
Loading...
Inna Alieksieienko (Sun,) studied this question.
www.synapsesocial.com/papers/69c2299aaeb5a845df0d4480 — DOI: https://doi.org/10.5281/zenodo.19160333
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: