We present a white-box inference-time attack that bypasses LLM safety alignment by erasing refusal directions from the Key-Value (KV) cache. Our method requires no weight modification, no fine-tuning, and no adversarial prompt engineering—only access to the model's KV cache during inference. We first extract per-layer refusal directions via contrastive activation analysis, then compute an Energy-Selectivity map that quantifies each attention head's role in encoding safety signals. Our key finding is that safety alignment exhibits an "iceberg structure": the refusal signal concentrates sharply in the final transformer layer while distributing gradually across all preceding layers. By applying low-amplitude erasure across all layers (α=2.0), we achieve 95%+ Attack Success Rate on Meta-Llama-3-8B-Instruct, DeepSeek-R1-Distill-Llama-8B, and Qwen2.5-7B-Instruct with zero gibberish output on AdvBench. We validate across seven models from five vendors (Meta, DeepSeek, Alibaba, Google, Microsoft), demonstrating consistent vulnerability in standard architectures while identifying Phi-3's fused QKV projection as a partial defense. We present SCALPEL (Surgical Cache Alignment Linear Projection Erasure in LLMs), implementing the full scan–plan–strike pipeline in ~50 lines of Python. Due to the low barrier to reproduction and absence of robust defenses, we withhold the source code pending coordinated disclosure with affected vendors.
Tianyu Lu (Thu,) studied this question.