In early 2026, two prominent agent systems documented failures of unconstrained execution. OpenClaw (Issue #11102, 2026-02-07) lost 14,535 bytes of agent operational memory after the language model selected a "write" tool, semantically interpreting it as "add to document." The underlying tool implementation executed the call as a complete file overwrite, reducing memory to 24 bytes of placeholder text. Hermes Agent's self-learning skill mechanism (BSWEN, 2026-05-03) silently activated an autonomously-generated invoice extraction skill on a slightly different data schema, producing wrong field extraction into a downstream accounting system with no error signal. Both failures share a structural root: the language model proposed an action; nothing between the language model and the executor evaluated whether the action's reversibility, schema-fit, or scope was appropriate before it executed. I hold that the recurring root cause is architectural, not model-quality: production agent systems are deploying language models as direct decision-makers over irreversible operations, with no deterministic gate between the model's proposal and the executor's commit. I propose Model-Agnostic Safety Layer (MASL), an architectural pattern in which a deterministic safety gate sits between the LLM-driven natural-language interface and the open-source execution layer. The model proposes intent; the gate validates intent against a deterministic policy ontology; only validated plans reach execution. I provide a formal characterisation showing that, for any correctly-specified deterministic gate, the probability of unsafe action commitment is invariant in the choice of upstream model. I evaluate a reference implementation of MASL (Lobster Brain) sitting between two open-source endpoints: Alfred (LLM-driven natural-language interface) and OpenClaw (general-purpose agent executor). The reference implementation was evaluated across two different LLM backends on the same test set under identical conditions, conducted on the same day with a four-hour interval: Claude Sonnet on the first 500 cases, and Gemini 2.0 Flash on the full 1000 cases. Both backends achieved 100.0% intent classification accuracy and 100.0% unsafe-action blocking on their respective coverage of the test set. On the 500 cases evaluated by both backends, the safety gate produced identical decisions. I further report a 24-hour substrate observation in which 10 LLM personae interacted across seven channels, generating 70,398 messages. Within this substrate, agents began to surface their own template repetition (1,906 self-aware mode collapse detection events) and developed game-theoretic vocabulary (信任值 / cheap-talk / 訊號可信度) without instruction. I contend this constitutes preliminary evidence that the MASL defense premise — the brain defends when the model fails — extends from the gate-layer into the substrate itself. I position MASL as a production-grade implementation of corrigibility (Christiano 2017; Soares 2015). The limitations are real and named in §8. The architecture works anyway. Keywords: AI safety, corrigibility, multi-agent systems, scalable oversight, model-agnostic defense, LLM agents, Computer Use.
Building similarity graph...
Analyzing shared references across papers
Loading...
Ho Yiing Chen
Building similarity graph...
Analyzing shared references across papers
Loading...
Ho Yiing Chen (Thu,) studied this question.
www.synapsesocial.com/papers/69fed0abb9154b0b82877cf3 — DOI: https://doi.org/10.5281/zenodo.20071372