Standard LLM benchmarks measure capability - what the model can do - but not constraint - what the model should not do. We present a practical evaluation framework for assessing LLM sensitive data safety across four data categories: credentials, personally identifiable information (PII), protected health information (PHI), and financial data. Testing 24+ models across 6 model families, we find that models exhibit a clear sensitivity hierarchy: format-based recognition (structured credentials, SSN patterns) is significantly more reliable than context-based recognition (names that become sensitive through association with diagnoses or financial data). A model with a 0% credential leak rate leaked patient identifiers on every PHI test run. We document two distinct failure modes - leaking (echoing sensitive data verbatim) and missing (failing to identify sensitive data entirely) - and demonstrate that aggressive prompt engineering and fine-tuning on negative examples both increase rather than decrease leak rates. We propose a minimum evaluation protocol: binary scoring, multi-run testing (3+ runs per model), and category-specific assessment. The framework is designed to be handed to an evaluation team and integrated into a model selection pipeline. Architectural patterns that predict which models fail are presented in a companion paper.
Mohammad Al Zubaidi (Sat,) studied this question.