Synthetic populations built from large language models are increasingly used to study collective behavior, yet their scientific value depends on whether their responses can be measured under controlled institutional conditions rather than inspected narratively. We introduce TRIBE v2, a synthetic-population benchmark that operationalizes governance as a constrained distributed-computation problem. Agents act in two coupled institutional arms — normative reform and resource allocation — under rule systems with explicitly encoded constitutional lock-in (CLI) and layered normative palimpsests. In a canonical 10-replica English-stateful reliable-lane run (Phase 2B) using Claude Haiku 4.5, Qwen 3.7 Max, and Mistral Large 2512, we observe a sharp high-CLI phase transition. Under identical formal norms and stateful memory, Qwen and Mistral converge to surface-exception evasion at high CLI, while Claude converges to adjudicatory exploration. The effect is discontinuous: the low-versus-medium contrast is not significant (Fisher p = 0.726), while low-versus-high and medium-versus-high are extreme (p = 1.05e-15 and p = 7.86e-16). World-cluster bootstrap intervals for Qwen high and Mistral high are 1.000, 1.000 under both model self-labels and a deterministic external judge. Mistral additionally exhibits a declared/action divergence that reaches 81.1% in English stateful, 75.7% in stateless, and 100% in Spanish — operationalizing a surface-compliance/structural-evasion pattern across conditions. Robustness runs confirm that the high-CLI transition survives memory removal (Mistral high stateless: 0.983 0.958, 1.000; Qwen high: 1.000 1.000, 1.000), prompt-language variation, and is present in DeepSeek V4 Pro despite that model remaining outside the confirmatory lane due to provider reliability constraints. These findings support a narrower and stronger claim than cultural attribution: model families display specific institutional response profiles under identical formal constraints. TRIBE v2 provides a reproducible method for stress-testing institutional rule systems and model behavior without treating LLM agents as direct human proxies.
Ignacio Adrián LERER (Wed,) studied this question.