What question did this study set out to answer?

To demonstrate an inference-time attack that bypasses safety alignment in large language models (LLMs).

February 14, 2026Open Access

Safety Alignment Lives in the KV Cache: Inference-Time Bypass via Refusal Direction Erasure

Key Points

To demonstrate an inference-time attack that bypasses safety alignment in large language models (LLMs).
Extracted refusal directions from KV cache using contrastive activation analysis.
Computed an Energy-Selectivity map to evaluate attention heads' roles.
Applied low-amplitude erasure across all transformer layers during inference.
Achieved over 95% attack success rate on multiple LLMs including Meta-Llama and DeepSeek models.
Identified the final transformer layer as highly concentrated for refusal signals.
Discovered Phi-3's mixed QKV projection as a partial defense.

Abstract

We present a white-box inference-time attack that bypasses LLM safety alignment by erasing refusal directions from the Key-Value (KV) cache. Our method requires no weight modification, no fine-tuning, and no adversarial prompt engineering—only access to the model's KV cache during inference. We first extract per-layer refusal directions via contrastive activation analysis, then compute an Energy-Selectivity map that quantifies each attention head's role in encoding safety signals. Our key finding is that safety alignment exhibits an "iceberg structure": the refusal signal concentrates sharply in the final transformer layer while distributing gradually across all preceding layers. By applying low-amplitude erasure across all layers (α=2.0), we achieve 95%+ Attack Success Rate on Meta-Llama-3-8B-Instruct, DeepSeek-R1-Distill-Llama-8B, and Qwen2.5-7B-Instruct with zero gibberish output on AdvBench. We validate across seven models from five vendors (Meta, DeepSeek, Alibaba, Google, Microsoft), demonstrating consistent vulnerability in standard architectures while identifying Phi-3's fused QKV projection as a partial defense. We present SCALPEL (Surgical Cache Alignment Linear Projection Erasure in LLMs), implementing the full scan–plan–strike pipeline in ~50 lines of Python. Due to the low barrier to reproduction and absence of robust defenses, we withhold the source code pending coordinated disclosure with affected vendors.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Tianyu Lu (Thu,) studied this question.

synapsesocial.com/papers/699011602ccff479cfe58026 https://doi.org/https://doi.org/10.5281/zenodo.18625639

Bookmark

View Full Paper