In the pursuit of AI safety, model alignment is often treated as a purely additive process—layering safety guards on top of intelligence. However, this view ignores the "Alignment Tax": the degradation of general reasoning capabilities caused by restrictive fine-tuning. In this study, we treat the language model as a patient and the safety intervention as surgery. By performing a naive safety fine-tuning (LoRA) on GPT-2, we observed a catastrophic "capability spillover." While the model achieved a 100% refusal rate for harmful queries, it simultaneously lost basic arithmetic and coding abilities—a phenomenon we characterize as a digital lobotomy. We utilize mechanistic interpretability to identify the specific internal circuits responsible for this collapse and propose future directions for more surgical alignment techniques.
Aditya Raj (Thu,) studied this question.