What question did this study set out to answer?

The aim is to enhance steering accuracy in large language models while ensuring differential privacy.

March 10, 2026Open Access

DP-JL: Differentially Private Steering via Johnson–Lindenstrauss Projection for Large Language Models

Key Points

The aim is to enhance steering accuracy in large language models while ensuring differential privacy.
Introduced DP-JL that combines Johnson–Lindenstrauss projection with differential privacy.
Projected steering vectors into a lower-dimensional space before noise addition.
Evaluated on seven behavioral datasets using various large language models.
Achieved up to 22.76 percentage points higher steering accuracy compared to existing methods.
Demonstrated a 91.7% win rate on sycophancy tasks with an average accuracy improvement of 3.01 points.
Showed systematic advantages in high-privacy conditions, maintaining robust model capabilities.

Abstract

Steering large language models (LLMs) toward desired behaviors while preserving privacy is a critical challenge in AI alignment. Existing differentially private (DP) steering methods, such as PSA, add high-dimensional noise that can severely degrade steering accuracy. We propose DP-JL, a novel approach that combines Johnson–Lindenstrauss (JL) random projection with differential privacy to reduce noise while maintaining formal privacy guarantees. DP-JL projects steering vectors into a lower-dimensional space (dimension k) before adding DP noise, reducing total noise magnitude from O(d) to O(k) where k≪d, while the privacy budget ε remains unchanged. We evaluate DP-JL on seven behavioral datasets with LLaMA-2-7B, Mistral-7B, Qwen2.5-7B, and Gemma-2-9B, alongside general capability benchmarks (MMLU, TruthfulQA). All accuracy values are measured on held-out test sets. Results show that DP-JL achieves: (1) up to 22.76 percentage points higher steering accuracy than PSA on the myopic-reward dataset (at fixed privacy budget ε≈0.22, δ=10−5); (2) 91.7% win rate on sycophancy with an average accuracy improvement of 3.01 percentage points; (3) systematic advantages in high-privacy regimes (ε<0.2); and (4) superior capability preservation on related tasks (TruthfulQA), achieving 6.6 percentage points better accuracy than PSA. Furthermore, visualizations and layer-sensitivity analyses reveal that DP-JL faithfully preserves the geometric structure of activation spaces, explaining its robustness. Our findings demonstrate that DP-JL offers superior privacy–utility trade-offs while better preserving model capabilities.

DP-JL: Differentially Private Steering via Johnson–Lindenstrauss Projection for Large Language Models

Key Points

Abstract

Cite This Study