What question did this study set out to answer?

The study aims to determine if LLM-driven micro-behavior can yield stable macro equilibria and its sensitivity to prompt variations.

May 29, 2026Open Access

Prompt Sensitivity in LLM-Based Market Agents

Read Full Paperexternally

Key Points

The study aims to determine if LLM-driven micro-behavior can yield stable macro equilibria and its sensitivity to prompt variations.
Developed hybrid agents using LLMs to analyze market behavior under different prompt conditions.
Conducted experiments to observe the effects of narrative context and brand distinctions on market dynamics.
Implemented statistical analyses including mixed-effects factorial ANOVA and aggregate Wilcoxon tests to evaluate outcomes.
Agent behavior varies significantly based on how brands are described, impacting market shares and equilibrium stability.
Experiments show fluctuations in macro-level outcomes when distinct prompts are applied (e.g., 'practical' vs. 'loyal').
Findings revealed that product differentiation is not essential for stable market structure, suggesting it may arise from cognitive and informational factors.

Abstract

Background and Rationale Agent-based models (ABMs) increasingly incorporate large language models (LLMs) to generate realistic agent behavior. Simultaneously, parametric models such as NBD-Dirichlet remain the gold standard for predicting aggregate market outcomes. A fundamental question arises: can LLM-driven micro-behavior yield stable macro-level equilibria, and if so, how sensitive is this equilibrium to prompt specification? The core theoretical motivation is reframing the coupling problem. Rather than seeking distributional consistency between LLM outputs and parametric models (which requires an intractable joint distribution), we propose interventional grounding: hybrid agents are ultimately valid if their response to interventions matches real-world effects. Scope of this study: This preregistration does not directly test interventional grounding. Instead, it establishes preconditions needed for later interventional tests: whether LLM-based market agents converge to stable macro behavior and how sensitive that macro behavior is to prompt specification. If market shares fluctuate wildly based on whether an agent is described as “practical” or “loyal, ” no intervention test is meaningful. Interpretation / construct validity: We treat each LLM agent as a stochastic policy mapping (prompt + state) → distribution over actions (e. g. , A, B, C). This is not a claim that the model instantiates human consumer cognition. Temperature is a decoding parameter controlling sampling stochasticity within the constrained action set; we use it as a controlled knob on policy randomness, not as a psychological construct such as “uncertainty. ” Mechanistic inferences are therefore framed as prompt-to-policy-to-macro mappings rather than literal consumer psychology. Sufficiency thesis and scope of inference: By construction, the baseline markets in Experiments 1–2 contain no objective product differentiation (brands are described as comparable in quality/features). Therefore, these experiments cannot falsify the claim that real-world market structure is grounded in product attributes. What they can test (and potentially falsify) is a different theoretical claim: that product differentiation is necessary for market structure (e. g. , concentration, loyalty/polarisation, stable macro parameters) to emerge. In our setting, any such structure must arise from informational topology and cognition (social information, memory, heterogeneity, stochasticity) rather than physical superiority. Experiment 2’s “Laundry Detergent” vs. “Fragrance” manipulation should therefore be interpreted as a manipulation of category meaning / visibility / involvement (i. e. , narrative context and priors), not as physical attribute differences. Experiment 3 then introduces an explicit verbal advantage cue in the LLM prompt — a textual statement that one brand is reported to perform better — to test how informational mechanisms behave when an asymmetric brand-performance signal is present in the prompt (not when objective physical attributes are present). v1. 0. 0 (2026-05-07): Pre-publish revisions applied per 5-reviewer independent panel (ABM, LLM-evaluation, marketing-science, metascience, Codex). Path B decisions: D1 = replace Smartphone with Fragrance in the Conspicuous arm; D2 = dual-fit NBD-Dirichlet with method-of-moments + MLE and explicit success thresholds; D3a = keep-and-fix NBD-Dirichlet with Double Jeopardy slope elevated to confirmatory marketing-science outcome plus Sole-Choice Rate. See §21 Amendments Log in the canonical preregistration for the full revision list. v1. 0. 0 (2026-05-10): Pre-publication adversarial-review pass applied. Codex deep-dive (session 019e127a-a4b7-77b1-945b-f5f7d2b084b3) + advisor synthesis surfaced 3 blocker-grade and 5 mid-severity issues; an external review of the summary surfaced 30 additional items (10 must-fix + 20 supporting). Key changes vs prior monolithic design: (i) §10 / §6. 10 restructured into a phase-gated ask — Phase 0 (construct validity + throughput benchmark) gates Phase 1 (Exp 1 confirmatory), which gates Phases 2–3 (Exp 2 + Exp 3, conditional on H1 support). (ii) Exp 1c (ultra-scale finite-size scaling, ~1, 792 runs, ~336 GPU-days) deferred to v2. 0. 0. v1. 0. 0 total: 14, 730 runs / ~1. 86B inference calls / ~215 H100-GPU-days under the working throughput assumption (Phase 0 benchmarks this). (iii) Phase 0 construct-validity gate: B11 free-text ablation expanded from 10 to 30 runs (6 anchor cells × 5 reps) with pre-registered KL kill-switch at 0. 10 nats; v1 sampling auto-amends to free-text+parser before Phase 1 commits if breached. (iv) D3a Double Jeopardy slope kept as confirmatory (D-1b path, honoring the 2026-05-07 elevation) with per-run-instability caveat added at all confirmatory anchors — K=3 brand points give 1 residual DoF per per-run regression, so per-run slopes are noisy by construction and confirmatory interpretation rests on aggregate Wilcoxon over n=192 condition-medians. (v) Serving topology locked: tensor-parallel TP=2 across 2× H100 80GB BF16 per replica (Llama-3. 3-70B at BF16 ~140GB does not fit single 80GB H100) ; vLLM 0. 7. x with constrained-decoding via outlines/guideddecoding pinned; TGI fallback runs reported separately. (vi) §9. 1 split by outcome: mixed-effects factorial ANOVA for entropy, AFT with right-censoring (lognormal default + AIC selection) for τ, aggregate Wilcoxon for DJ-slope. (vii) Exp 3 made conditional on Phase 1 H1 support (was “still runs for completeness regardless”) ; v1/v2 staging discipline tightened. (viii) Run-count totals reconciled across full prereg and summary (was 16, 502 / 16, 362 / 17, 342 — three internally inconsistent figures). (ix) Constrained-logit tokenizer spec, balanced 6-permutation brand-label cycling, and full KL gate procedure (250k agent-step empirical, Laplace smoothing, JS auxiliary) pre-registered. (x) Code repository, container image, and reproducibility script (scripts/renderₛummaryₚdf. sh) committed. (xi) Soft-hyphen artifacts in the summary PDF (Chrome auto-hyphenation rendering bug) eliminated via CSS hyphens: none. See §21 Amendments Log in the canonical preregistration for the line-by-line audit trail (rows 2026-05-10).

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Massimiliano Marinucci

Link Consulting

Actions

Institutions

Link Consulting

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Prompt Sensitivity in LLM-Based Market Agents

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study