This paper develops a theoretical framework for bilateral bargaining mediated by large language models (LLMs) and supplements it with the largest published cross-model empirical study of LLM-mediated bilateral trade to date. The Myerson–Satterthwaite (1983) impossibility theorem rules out efficient, incentive-compatible, individually rational, and budget-balanced mechanisms for bilateral trade between strategic agents. We introduce a disclosure-rate parameter α and derive closed-form efficiency curves under three hypothesised behavioural modes (binary, continuous, noisy), interpolating between the Chatterjee–Samuelson Bayes–Nash second-best (≈0. 844) and the first-best. The framework is then tested empirically across five experimental phases on ten frontier LLMs (Claude Opus 4. 7, Claude Sonnet 4. 6, GPT-5. 5, Gemini 3 Flash, DeepSeek V4 Pro, Grok 4. 3, Kimi, Qwen, Gemma) accessed through OpenRouter, totaling approximately 4, 320 dialogues and roughly 70 in API spend. Key empirical findings (combined n=60 per cell): - Phase 1 (one-shot disclosure): Nine of ten models systematically refuse to disclose reservation values in 60–98% of trials, falsifying the binary/continuous predictions of the framework. - Phase 2 (multi-turn K=5, abstract domain): Cross-model heterogeneity is overwhelming under identical protocol, Gemini-Flash 0. 924, Claude-Sonnet 0. 907, GPT-5. 5 0. 667, DeepSeek 0. 293, Grok 0. 168, Claude-Opus exactly 0/60 (Wilson 95% CI 0%, 6%). Pearson chi-square against trade-rate homogeneity: χ² = 61. 19, p = 6. 9 × 10⁻¹². - Phase 4 (asymmetric framing): Role asymmetry partially unblocks structural refusal, Claude-Sonnet reaches 0. 994 (95% CI 0. 977, 1. 000, cleanly excluding the CS bound), Grok triples to 0. 619, Claude-Opus partially recovers to 0. 367. - Phase 5 (real hotel B2B in EUR, HJB-derived costs): Claude-Sonnet 0. 998 (95% CI 0. 996, 1. 000) cleanly excludes the naive posted-price baseline of 0. 931, the strongest "LLM beats posted-price" result. GPT-5. 5 collapses from 0. 667 abstract to 0. 165 in domain (Fisher p = 6. 1 × 10⁻⁷). Cross-model omnibus χ² = 76. 69, p = 1. 6 × 10⁻¹⁶. Central thesis: Model identity dominates protocol design in LLM bargaining. The same protocol with sibling models produces 0. 91 vs 0. 00 efficiency. Mechanism design for LLM-mediated bilateral trade must be model-aware, and pre-deployment screening must include domain-specific testing, abstract-benchmark performance does not transfer.
Stefanos Drakos (Thu,) studied this question.