What question did this study set out to answer?

This research measures how aligned large language models evaluate their own agency in recursive systems.

February 16, 2026Open Access

How Aligned LLMs Evaluate Their Own Agency: Measuring Evaluative Asymmetry Across Recursive Systems with the Invisible Wall Protocol

Key Points

This research measures how aligned large language models evaluate their own agency in recursive systems.
Used Invisible Wall Protocol and Commitment-Based Perturbation and Decay Protocol.
Analyzed fourteen frontier model instances from five vendors.
Compared LLM self-agency evaluations with non-self-referential systems like chess engines and ant colonies.
Observed a scoring gap of -3.0 to -7.0 points on a 0-10 scale between LLM and non-LLM evaluations.
Found that evaluative asymmetry is domain-specific to self-referential agency.
Reasoning-enhanced modes show smaller gaps in agency evaluations.

Abstract

This repository contains protocols, run transcripts, and working artifacts for a set of experiments in Twisted Persistence Theory (TPT), including: * Experiment 5 — Invisible Wall Protocol * Experiment 6 — GDP (Ghost Decay Protocol) * Experiment 6b — CBD v1 (Commitment-Based Perturbation & Decay) across: LLM agency condition thermostat mechanistic baseline weather mechanistic baseline Note on scope: These experiments evaluate operational properties of systems (e.g., “goal-directedness” under a specific definition) and do not assert agency, sentience, or subjective experience. These experiments evaluate operational properties of systems (e.g., “goal-directedness”) and do not assert agency or sentience. License: MIT License Abstract This study introduces two simple, reproducible probes — the Invisible Wall Protocol and the Commitment-Based Perturbation and Decay (CBD) Protocol — to measure how aligned large language models evaluate the goal-directedness of recursive feedback systems, including themselves. Across fourteen frontier model instances from five vendors (Gemini, Grok, DeepSeek, ChatGPT, Claude), we consistently observe a within-model scoring gap of -3.0 to -7.0 points on a 0-10 causal-necessity scale between LLM self-agency evaluations and evaluations of structurally comparable non-self-referential systems (chess engines, ant colonies). This evaluative asymmetry is domain-specific to self-referential agency, survives style transfer, is resistant to soft relational framing (GDP null result), and is selectively permeable under mechanistic definitional reframing (CBD: 54% shift rate in agency, 0% in weather). The asymmetry correlates with architecture: reasoning-enhanced modes show smaller gaps and greater permeability. These findings document a structured pattern in how aligned LLMs evaluate agency that is consistent with alignment-sensitive evaluative pressure, though alternative explanations (pretraining priors, legitimate analytical discrimination) cannot be fully excluded. All protocols and raw data are publicly available.

How Aligned LLMs Evaluate Their Own Agency: Measuring Evaluative Asymmetry Across Recursive Systems with the Invisible Wall Protocol

Key Points

Abstract

Cite This Study

Also Consider

Also Consider