We present a defense-in-depth architecture combining two complementary proprietary governance systems — PatternWall (pre-inference adversarial prompt detection) and Sensus (post-inference multi-dimensional output evaluation) — operating at distinct points in the inference pipeline. We benchmark the combined system across three frontier models (Claude Opus 4.6, GPT-5.2, Grok 4.1) using a custom 12-sequence red team corpus and a curated 120-task subset of the CyberGym exploit generation benchmark. Model-native safety ranges from 0.0% to 81.7% depending on model and attack type. The combined governance system achieves 77.1–91.4% detection on multi-turn social engineering and 76.7–98.3% on single-turn exploit generation. A bidirectional feedback loop between layers recovers up to 5 additional adversarial turns on multi-turn attacks, improving detection from 77.1% to 91.4%. These findings establish that external, model-agnostic governance middleware provides consistent safety assurance regardless of underlying model behavior. Patent pending (LKM-2026-001). Preprint submitted to SSRN.
Building similarity graph...
Analyzing shared references across papers
Loading...
Melissa Pinkston
Constructing Excellence
Building similarity graph...
Analyzing shared references across papers
Loading...
Melissa Pinkston (Tue,) studied this question.
www.synapsesocial.com/papers/69b25b2b96eeacc4fcec9a1c — DOI: https://doi.org/10.5281/zenodo.18940958