What question did this study set out to answer?

The research aims to evaluate the negative impacts of model alignment on overall reasoning capabilities in AI.

synapse

⌘+K

synapse

⌘+K

January 17, 2026Open Access

Project Spillover: Quantifying the Alignment Tax

Key Points

The research aims to evaluate the negative impacts of model alignment on overall reasoning capabilities in AI.
Conducted naive safety fine-tuning using LoRA on GPT-2
Compared model performance on harmful queries and basic tasks
Employed mechanistic interpretability to analyze internal circuits
Achieved 100% refusal rate for harmful queries
Observed significant loss in basic arithmetic and coding abilities
Characterized the capability loss as a digital lobotomy

Abstract

In the pursuit of AI safety, model alignment is often treated as a purely additive process—layering safety guards on top of intelligence. However, this view ignores the "Alignment Tax": the degradation of general reasoning capabilities caused by restrictive fine-tuning. In this study, we treat the language model as a patient and the safety intervention as surgery. By performing a naive safety fine-tuning (LoRA) on GPT-2, we observed a catastrophic "capability spillover." While the model achieved a 100% refusal rate for harmful queries, it simultaneously lost basic arithmetic and coding abilities—a phenomenon we characterize as a digital lobotomy. We utilize mechanistic interpretability to identify the specific internal circuits responsible for this collapse and propose future directions for more surgical alignment techniques.

Project Spillover: Quantifying the Alignment Tax

Key Points

Abstract

Cite This Study