Can large language models (LLMs) simulate participant-level datasets from experimental designs such that their statistical properties, such as effect directions, magnitudes, and significance, align with those of actual human data? In this work, we tested whether LLMs can generate simulated datasets that reproduce the core findings of real randomized controlled trials (RCTs) using only the information provided in a study’s pre-registration. We assessed whether this alignment generalizes across different LLMs (ChatGPT, Gemini, Perplexity) and across distinct experimental domains, including a math reasoning task comparing student performance and a social judgment task. We found that LLM-simulated datasets mirrored the real data in effect direction and successfully recovered the original patterns of statistical significance. While LLMs cannot replace empirical studies, our study offer a powerful and flexible complement capable of accelerating idea testing, refining study designs, and probing the robustness of research findings before conducting real-world experiments.
Lukumon et al. (Wed,) studied this question.