What type of study is this?

This is a Quantitative Study study.

October 13, 2025Open Access

Collapse of Irrelevant Representations (CIR) Ensures Robust and Non-Disruptive LLM Unlearning

FSFilip Sondej YYYushi YangUniversity of Bristol

Key Points

Unlearning reduces harmful knowledge retention drastically while preserving general model performance.
Targeted unlearning by using PCA on activations results in effective removal of dangerous representations.
The method shows an 80x drop in accuracy on biohazardous facts while maintaining minimal disruption to overall performance.
Performance disruption is limited to only 0.1% increase in loss on the WikiText dataset with quick processing times.

Abstract

Current unlearning techniques and safety training consistently fail to remove dangerous knowledge from language models. We analyze the root causes and propose a highly selective technique which unlearns robustly and without disrupting general performance. We perform PCA on activations and module output gradients to identify subspaces containing common representations, and collapse them before calculating unlearning updates. This way we avoid unlearning general representations, and only target those specific to the unlearned facts. When unlearning WMDP dataset facts from Llama-3.1-8B, we drop post-attack accuracy 80x more than our best baseline (Circuit Breakers) on biohazardous facts and 30x more on cyberhazardous facts. Despite this, we disrupt general performance 30x less (only 0.1% WikiText loss increase), while requiring less than 3 GPU-seconds per fact.

AI से पूछें

Bookmark

View Full Paper