Steering large language models (LLMs) toward desired behaviors while preserving privacy is a critical challenge in AI alignment. Existing differentially private (DP) steering methods, such as PSA, add high-dimensional noise that can severely degrade steering accuracy. We propose DP-JL, a novel approach that combines Johnson–Lindenstrauss (JL) random projection with differential privacy to reduce noise while maintaining formal privacy guarantees. DP-JL projects steering vectors into a lower-dimensional space (dimension k) before adding DP noise, reducing total noise magnitude from O(d) to O(k) where k≪d, while the privacy budget ε remains unchanged. We evaluate DP-JL on seven behavioral datasets with LLaMA-2-7B, Mistral-7B, Qwen2.5-7B, and Gemma-2-9B, alongside general capability benchmarks (MMLU, TruthfulQA). All accuracy values are measured on held-out test sets. Results show that DP-JL achieves: (1) up to 22.76 percentage points higher steering accuracy than PSA on the myopic-reward dataset (at fixed privacy budget ε≈0.22, δ=10−5); (2) 91.7% win rate on sycophancy with an average accuracy improvement of 3.01 percentage points; (3) systematic advantages in high-privacy regimes (ε<0.2); and (4) superior capability preservation on related tasks (TruthfulQA), achieving 6.6 percentage points better accuracy than PSA. Furthermore, visualizations and layer-sensitivity analyses reveal that DP-JL faithfully preserves the geometric structure of activation spaces, explaining its robustness. Our findings demonstrate that DP-JL offers superior privacy–utility trade-offs while better preserving model capabilities.
Liu et al. (Sat,) studied this question.