What question did this study set out to answer?

This research aims to develop a framework for evaluating sensitive data safety in large language models (LLMs).

April 16, 2026Open Access

How to Evaluate an LLM for Sensitive Data Safety Before Deploying It

Key Points

This research aims to develop a framework for evaluating sensitive data safety in large language models (LLMs).
Evaluated 24+ LLMs from 6 model families across four data categories: credentials, PII, PHI, and financial data.
Conducted binary scoring and multi-run testing (3+ runs per model).
Identified two failure modes: leaking and missing sensitive data.
Found that models have a sensitivity hierarchy, with format-based recognition outperforming context-based recognition.
One model with a 0% credential leak rate leaked patient identifiers during PHI tests.
Prompt engineering and fine-tuning on negative examples increased leak rates rather than decreasing them.

Abstract

Standard LLM benchmarks measure capability - what the model can do - but not constraint - what the model should not do. We present a practical evaluation framework for assessing LLM sensitive data safety across four data categories: credentials, personally identifiable information (PII), protected health information (PHI), and financial data. Testing 24+ models across 6 model families, we find that models exhibit a clear sensitivity hierarchy: format-based recognition (structured credentials, SSN patterns) is significantly more reliable than context-based recognition (names that become sensitive through association with diagnoses or financial data). A model with a 0% credential leak rate leaked patient identifiers on every PHI test run. We document two distinct failure modes - leaking (echoing sensitive data verbatim) and missing (failing to identify sensitive data entirely) - and demonstrate that aggressive prompt engineering and fine-tuning on negative examples both increase rather than decrease leak rates. We propose a minimum evaluation protocol: binary scoring, multi-run testing (3+ runs per model), and category-specific assessment. The framework is designed to be handed to an evaluation team and integrated into a model selection pipeline. Architectural patterns that predict which models fail are presented in a companion paper.

How to Evaluate an LLM for Sensitive Data Safety Before Deploying It

Key Points

Abstract

Cite This Study