What question did this study set out to answer?

This research aims to evaluate the effectiveness of large language models in simulating human risk preferences across different cultures.

May 24, 2026Open Access

Can large language models capture human risk preferences? A cross-cultural study

Key Points

This research aims to evaluate the effectiveness of large language models in simulating human risk preferences across different cultures.
Implemented a comparative framework to benchmark LLM-simulated risk preferences against human behavior.
Used surveys from Sydney, Hong Kong, and Nanjing to construct role-playing prompts and assess three LLMs on lottery-choice tasks.
Applied the Constant Relative Risk Aversion (CRRA) framework for a standardized comparison of risk attitudes.
LLMs displayed varying risk profiles, with GPT models being more risk-averse than human benchmarks, while Gemini was more risk-seeking.
Prompt language alteration affected risk attitudes significantly, with English-to-Chinese transitions often leading to more conservative choices.
LLMs failed to accurately replicate the diversity of human risk preferences, producing either overly concentrated or excessively dispersed distributions.

Abstract

Large language models (LLMs) are increasingly used as agents to simulate human behavior, yet their fidelity in complex decision-making under uncertainty remains insufficiently understood. To address this gap, we develop a comparative framework that benchmarks LLM-simulated risk preferences against empirical human behavior. Using demographic profiles from surveys conducted in Sydney, Hong Kong, and Nanjing, we construct role-playing prompts and evaluate three LLMs on abstract lottery-choice tasks. We adopt the classical Constant Relative Risk Aversion (CRRA) framework as a domain-neutral “standard ruler” to compare risk attitudes. The analysis yields three main findings. First, off-the-shelf LLMs do not exhibit a universal risk profile: the two GPT models are more risk-averse than human benchmarks, whereas Gemini is more risk-seeking. Second, prompt language systematically affects simulated risk attitudes, with English-to-Chinese switching inducing a more conservative shift in most cases. Third, LLMs do not reliably reproduce the empirical heterogeneity of human risk preferences, tending either to generate overly concentrated distributions or unrealistically large dispersion. Taken together, these findings show that off-the-shelf LLMs remain vulnerable to model-family-specific miscalibration, language-sensitive distortions, and failures in distributional fidelity. Rigorous empirical calibration is therefore necessary before off-the-shelf LLMs can be reliably deployed in computational social science and choice modeling.

KI fragen

Bookmark

View Full Paper