As large language models (LLMs) are increasingly used as digital respondents and generative agents in computational social science, prior work has primarily focused on the fidelity of their expressed opinions, often overlooking a fundamental question: the behavioral stability of outputs across repeated runs of the same model when the persona specification and task conditions remain unchanged. This paper proposes a behavioral stability evaluation framework for role-playing tasks, using the Political Compass questionnaire as the testbed. The questionnaire maps responses onto a two-dimensional coordinate system defined by an economic axis and a social axis, enabling political orientations to be directly quantified and compared in a continuous space. To ground the simulation in realistic user behaviors, we construct personas from publicly available social media texts and stratify them based on Political Signal Clarity. Across three LLMs, we compare repeated questionnaire completions under different decoding temperatures and prompting strategies. We characterize it along two complementary dimensions: dispersion of the resulting two-dimensional coordinates across runs, measured by an Overall Stability Score (OSS), and dispersion of per-item choices across runs, quantified by response entropy. We further use linear mixed-effects models to account for persona-level heterogeneity and to estimate the effects of key factors on stability. Our results show that coordinate drift and item-level dispersion do not always move in tandem. Increasing temperature typically amplifies variability, although models differ in their sensitivity. Contrary to its success in reasoning tasks, Chain-of-Thought (CoT) prompting failed to enhance stability in this value-laden context. Instead, it frequently amplified coordinate drift by introducing stochasticity into intermediate reasoning steps. Results show that LLMs exhibit greater behavioral stability when role-playing personas with clearer political signals. These findings suggest that stability should be treated as a pre-study benchmark before deploying LLM-based role-playing simulations, and that key generation settings and stability statistics should be reported alongside substantive conclusions.
Shen et al. (Wed,) studied this question.