What question did this study set out to answer?

The research aims to evaluate how synthetic populations respond under controlled institutional conditions using large language models.

May 29, 2026Open Access

Synthetic Populations as Institutional Stress Tests: Constitutional Lock-In and Model-Specific Phase Transitions in LLM Synthetic Societies

Key Points

The research aims to evaluate how synthetic populations respond under controlled institutional conditions using large language models.
Introduced TRIBE v2 benchmark for synthetic populations focusing on governance as a distributed-computation problem.
Agents operated under normative reform and resource allocation with constitutional lock-in.
Canonical runs with multiple models observed to assess phase transitions and agent behavior under varying CLI levels.
High constitutional lock-in results in sharp phase transition for agents, with Qwen and Mistral converging to surface-exception evasion, while Claude explores adjudicatory paths.
Statistical analysis shows extreme significance during contrasts between low and high CLI (p = 1.05e-15, p = 7.86e-16).
Mistral's declared/action divergence reaches 81.1% in English stateful runs, indicating a pattern of surface compliance and structural evasion under institutional conditions.

Abstract

Synthetic populations built from large language models are increasingly used to study collective behavior, yet their scientific value depends on whether their responses can be measured under controlled institutional conditions rather than inspected narratively. We introduce TRIBE v2, a synthetic-population benchmark that operationalizes governance as a constrained distributed-computation problem. Agents act in two coupled institutional arms — normative reform and resource allocation — under rule systems with explicitly encoded constitutional lock-in (CLI) and layered normative palimpsests. In a canonical 10-replica English-stateful reliable-lane run (Phase 2B) using Claude Haiku 4.5, Qwen 3.7 Max, and Mistral Large 2512, we observe a sharp high-CLI phase transition. Under identical formal norms and stateful memory, Qwen and Mistral converge to surface-exception evasion at high CLI, while Claude converges to adjudicatory exploration. The effect is discontinuous: the low-versus-medium contrast is not significant (Fisher p = 0.726), while low-versus-high and medium-versus-high are extreme (p = 1.05e-15 and p = 7.86e-16). World-cluster bootstrap intervals for Qwen high and Mistral high are 1.000, 1.000 under both model self-labels and a deterministic external judge. Mistral additionally exhibits a declared/action divergence that reaches 81.1% in English stateful, 75.7% in stateless, and 100% in Spanish — operationalizing a surface-compliance/structural-evasion pattern across conditions. Robustness runs confirm that the high-CLI transition survives memory removal (Mistral high stateless: 0.983 0.958, 1.000; Qwen high: 1.000 1.000, 1.000), prompt-language variation, and is present in DeepSeek V4 Pro despite that model remaining outside the confirmatory lane due to provider reliability constraints. These findings support a narrower and stronger claim than cultural attribution: model families display specific institutional response profiles under identical formal constraints. TRIBE v2 provides a reproducible method for stress-testing institutional rule systems and model behavior without treating LLM agents as direct human proxies.

Read Full Paperexternally

Demander à l'IA

Bookmark

View Full Paper