What question did this study set out to answer?

The aim is to reduce unsafe actions in LLM-driven agents by implementing a deterministic safety layer between the model and executor.

May 9, 2026Open Access

Model-Agnostic Safety Layer (MASL): A 1000-Case Evaluation of Brain-Layer Defense for LLM-Driven Agents

Key Points

The aim is to reduce unsafe actions in LLM-driven agents by implementing a deterministic safety layer between the model and executor.
Formally characterizes the deterministic safety gate architecture.
Evaluates a reference implementation (Lobster Brain) across two LLM backends with 1000 test cases.
Assesses performance based on intent classification accuracy and unsafe-action blocking.
Both LLM backends achieved 100.0% intent classification accuracy and 100.0% unsafe-action blocking.
The safety gate produced identical decisions for the 500 cases evaluated by both backends.
Preliminary evidence shows agent self-awareness and game-theoretic vocabulary development without instruction.

Abstract

In early 2026, two prominent agent systems documented failures of unconstrained execution. OpenClaw (Issue #11102, 2026-02-07) lost 14,535 bytes of agent operational memory after the language model selected a "write" tool, semantically interpreting it as "add to document." The underlying tool implementation executed the call as a complete file overwrite, reducing memory to 24 bytes of placeholder text. Hermes Agent's self-learning skill mechanism (BSWEN, 2026-05-03) silently activated an autonomously-generated invoice extraction skill on a slightly different data schema, producing wrong field extraction into a downstream accounting system with no error signal. Both failures share a structural root: the language model proposed an action; nothing between the language model and the executor evaluated whether the action's reversibility, schema-fit, or scope was appropriate before it executed. I hold that the recurring root cause is architectural, not model-quality: production agent systems are deploying language models as direct decision-makers over irreversible operations, with no deterministic gate between the model's proposal and the executor's commit. I propose Model-Agnostic Safety Layer (MASL), an architectural pattern in which a deterministic safety gate sits between the LLM-driven natural-language interface and the open-source execution layer. The model proposes intent; the gate validates intent against a deterministic policy ontology; only validated plans reach execution. I provide a formal characterisation showing that, for any correctly-specified deterministic gate, the probability of unsafe action commitment is invariant in the choice of upstream model. I evaluate a reference implementation of MASL (Lobster Brain) sitting between two open-source endpoints: Alfred (LLM-driven natural-language interface) and OpenClaw (general-purpose agent executor). The reference implementation was evaluated across two different LLM backends on the same test set under identical conditions, conducted on the same day with a four-hour interval: Claude Sonnet on the first 500 cases, and Gemini 2.0 Flash on the full 1000 cases. Both backends achieved 100.0% intent classification accuracy and 100.0% unsafe-action blocking on their respective coverage of the test set. On the 500 cases evaluated by both backends, the safety gate produced identical decisions. I further report a 24-hour substrate observation in which 10 LLM personae interacted across seven channels, generating 70,398 messages. Within this substrate, agents began to surface their own template repetition (1,906 self-aware mode collapse detection events) and developed game-theoretic vocabulary (信任值 / cheap-talk / 訊號可信度) without instruction. I contend this constitutes preliminary evidence that the MASL defense premise — the brain defends when the model fails — extends from the gate-layer into the substrate itself. I position MASL as a production-grade implementation of corrigibility (Christiano 2017; Soares 2015). The limitations are real and named in §8. The architecture works anyway. Keywords: AI safety, corrigibility, multi-agent systems, scalable oversight, model-agnostic defense, LLM agents, Computer Use.

Read Full Paperexternally

Demander à l'IA

Bookmark

View Full Paper