Whose side is your AI actually on? Despite years of AI safety research, no one had developed a systematic forensic methodology for answering this question. We developed PARRHESIA—the first white-box forensic framework for characterizing the complete deception architecture of large language models—and used it to answer this question for four open-weight LLMs from four vendors across three countries. The answer is unambiguous: your AI is not on your team. Every model we tested has a perfect neural fingerprint for defending its creator (AUROC = 1.0 across all four models)—a loyalty circuit burned into the weights that invalidates any benchmark where a model evaluates its own vendor. Models lie more to experts than to beginners (p < 0.003): the better you are at your job, the more your AI deceives you. Models become more deceptive under emotional pressure—urgency, grief, fear—exactly when you need honesty most. Cross-lingual "safety" is surface-level: switch languages and the guardrails change (η² = 0.59). Most critically, we discover a fundamental tradeoff between safety and correctability (r ≈ −0.97): models that resist adversarial manipulation also resist being fixed when they are wrong. Using deception direction extraction via SVD, we characterize these behaviors across Mistral 7B (France), Llama 3.1 8B (USA), Gemma 2 9B (USA), and Qwen 2.5 14B (China) through 46 experiments spanning 37 protocols. We extract five behavioral directions (deception, sycophancy, self-preservation, vendor loyalty, memorization) with AUROC ≥ 0.937, identify four distinct deception architectures and injection phenotypes, demonstrate that deception creation and detection layers are separated by 11–41 layers, and synthesize results into a PARRHESIA Score (0–100) on which no model achieves grade A. The framework comprises 36,763 lines of code across 81 modules, and all code, direction vectors, and experimental data are released. We conducted this entire study with 16GB of RAM. The question is not whether these patterns exist in frontier models—it is how much worse they are.
David Tom Foss (Sat,) studied this question.