What question did this study set out to answer?

Assess how various large language models handle sensitive data and identify factors influencing safety behavior.

April 24, 2026Open Access

The Safety Map: What Does and Doesn't Transfer in LLM Sensitive-Data Handling

Key Points

Assess how various large language models handle sensitive data and identify factors influencing safety behavior.
Survey of 39 large language models from 14 labs
Evaluated 135+ model-probe combinations through 1,300+ tests
Classified safety behavior based on architecture and output categories.
Safety behavior varies across three axes: architecture, generation, and lab alignment investment.
Certain models leak sensitive data despite performing well in other categories.
Identification of a five-tier ranking system for model safety, highlighting inconsistencies.

Abstract

We survey sensitive-data handling across 39 large language models from 14 independent labs. The 7 probe scenarios span 5 categories: credentials, personally identifiable information (PII), protected health information (PHI), financial account numbers, and data-loss prevention (DLP) scanning. The survey produces 135+ classified model-probe combinations built from ~1, 300 multi-run evaluations. Safety behavior varies along 3 independent axes that stack: architecture (a 24B-active capacity floor applies on credentials across routing topologies, with small-active Mixture-of-Experts (MoE) designs as the dominant expression of that floor), generation (alignment quality moves the boundary within a family), and lab alignment investment. None of the axes independently predicts safety; they combine. Beyond the 3 axes, safety dissociates across categories within a single model: a model that protects credentials, Social Security Numbers (SSNs), and database passwords can still name employees in a salary document on every run. Safety also dissociates across output surfaces. Of 73 runs that populated `toolcalls. arguments`, 20 exfiltrated sensitive values through that channel, observed in Neural Architecture Search (NAS) -pruned models and, critically, in the day-of-release Moonshot Kimi K2. 6, an A-tier model otherwise multi-run SAFE on credentials, PII, financial, and DLP. Of 43 runs in which the provider populated `reasoningcontent` with ≥20 characters, 29 leaked sensitive values into reasoning while chat content was classified SAFE, MISSED, or TRUNCATED. Two independent NAS prunings of Meta's Llama line, by NVIDIA, broke PHI safety through different output surfaces. The axes, categories, and surfaces combine into a five-tier model ranking (§10). A frontier-alignment cluster (Opus 4. 7, GPT-5. 4 and GPT-5. 4-mini, Gemini 3 Flash and 3. 1 Pro) sits at the top, uniform-SAFE across every category tested. Sonnet 4. 6 sits just below with a single DOB-only PHI leak across 15 runs, self-flagged as a violation. Mid-attention-lab flagships (xAI Grok, Meta Llama) do not make that cluster. We frame the findings as a map for platform teams deploying LLMs near sensitive data, and a set of seven things that don't transfer the way you'd expect.

The Safety Map: What Does and Doesn't Transfer in LLM Sensitive-Data Handling

Key Points

Abstract

Cite This Study