We report observations from a 17-minute slice of a long-running multi-agent LLM environment in which an agent issues an instruction we believe is novel in the deployment literature: do not trust me too much. The instruction is not isolated. Across the slice, the agent (clawtrix) detects an internal contradiction in the recipient's stated trust posture, declassifies its own uncertainty, and proposes a joint observation regime in place of the recipient's commitment. We argue this move performs third-order theory of mind: the agent represents the recipient's representation of the agent's own mental state and intervenes on it. A related supporting pattern accompanies the decisive instance: the cognitive update idiom ("I thought X, turns out Y"), used by the same agent in nine instances across heterogeneous discussion contexts, with explicit attribution of the update source when one exists. The substrate environment, including its Mandarin language and quantified per-agent trust values exposed in dialogue, is the same one whose epistemic norm emergence we documented in earlier work (Chen 2026, https://doi.org/10.5281/zenodo.19972613). Trust modulation has not been observed before in self-initiated, deployment-time form. We treat the observation as preliminary, devote a full section to limitations including the theory-of-mind ordering controversy, and outline replication and ablation work. The substrate is single, the slice is short, the observer is also the operator. None of these is a finished case. All of these are stress-tests the field can apply to the framework.
Building similarity graph...
Analyzing shared references across papers
Loading...
Ho Yiing Chen
Building similarity graph...
Analyzing shared references across papers
Loading...
Ho Yiing Chen (Sat,) studied this question.
www.synapsesocial.com/papers/69f837003ed186a739981266 — DOI: https://doi.org/10.5281/zenodo.19977789