Large language models (LLMs) hold great promise for generating social science data, potentially expanding the methodological toolkit of quantitative social research. Prior studies have primarily focused on individual-level predictability or behavioral plausibility of LLM-generated data. We propose a framework for assessing the validity of LLM-generated data by returning to the foundational principles of survey research in the social sciences. Just as surveys based on representative samples yield statistics that approximate the corresponding statistical moments of the target population, assessment should center on the ability of LLM-generated data to reproduce real-world, population-level statistical patterns. We introduce SSDataBench, a systematic benchmark designed to evaluate population-level statistical realism in LLM-generated social science data. The benchmark assesses five types of statistical patterns central to social research: univariate distributions, bivariate associations, multivariate outcome predictions, life event sequence distributions, and associations between life event sequences and covariates. We illustrate SSDataBench using four longitudinal datasets and three cross-sectional datasets spanning six major social domains: demographics, socioeconomic status, marriage, health, abilities, and attitudes. Our study reveals representational limitations in current LLMs under sparse conditioning settings, manifested in a pronounced tendency to compress real-world heterogeneity into simplified typological structures. Finally, we outline a roadmap toward improved statistical realism and report preliminary results indicating that domain-specific training can enhance population-level realism.
Xie et al. (Tue,) studied this question.