What question did this study set out to answer?

The aim is to evaluate the behavioral stability of large language models in role-playing simulations using a specified questionnaire.

February 20, 2026Open Access

Before You Simulate: A Pre-Study Benchmark for Large Language Model Stability in Political Role-Playing Simulations

Key Points

The aim is to evaluate the behavioral stability of large language models in role-playing simulations using a specified questionnaire.
Developed a behavioral stability evaluation framework for role-playing tasks.
Constructed personas from social media texts, stratifying them based on political signal clarity.
Compared questionnaire completions across three LLMs using varying decoding temperatures and prompting strategies.
Coordinate drift and item-level dispersion do not always correlate.
Increasing temperature generally increases variability, but models show different sensitivity levels.
Chain-of-thought prompting did not enhance stability and sometimes increased coordinate drift.

Abstract

As large language models (LLMs) are increasingly used as digital respondents and generative agents in computational social science, prior work has primarily focused on the fidelity of their expressed opinions, often overlooking a fundamental question: the behavioral stability of outputs across repeated runs of the same model when the persona specification and task conditions remain unchanged. This paper proposes a behavioral stability evaluation framework for role-playing tasks, using the Political Compass questionnaire as the testbed. The questionnaire maps responses onto a two-dimensional coordinate system defined by an economic axis and a social axis, enabling political orientations to be directly quantified and compared in a continuous space. To ground the simulation in realistic user behaviors, we construct personas from publicly available social media texts and stratify them based on Political Signal Clarity. Across three LLMs, we compare repeated questionnaire completions under different decoding temperatures and prompting strategies. We characterize it along two complementary dimensions: dispersion of the resulting two-dimensional coordinates across runs, measured by an Overall Stability Score (OSS), and dispersion of per-item choices across runs, quantified by response entropy. We further use linear mixed-effects models to account for persona-level heterogeneity and to estimate the effects of key factors on stability. Our results show that coordinate drift and item-level dispersion do not always move in tandem. Increasing temperature typically amplifies variability, although models differ in their sensitivity. Contrary to its success in reasoning tasks, Chain-of-Thought (CoT) prompting failed to enhance stability in this value-laden context. Instead, it frequently amplified coordinate drift by introducing stochasticity into intermediate reasoning steps. Results show that LLMs exhibit greater behavioral stability when role-playing personas with clearer political signals. These findings suggest that stability should be treated as a pre-study benchmark before deploying LLM-based role-playing simulations, and that key generation settings and stability statistics should be reported alongside substantive conclusions.

Read Full Paperexternally

Demander à l'IA

Bookmark

View Full Paper

Cite This Study

Shen et al. (Wed,) studied this question.

synapsesocial.com/papers/6997f9c9ad1d9b11b34528fe https://doi.org/https://doi.org/10.3390/app16042027

Demander à l'IA

Bookmark

View Full Paper