What question did this study set out to answer?

This research aims to assess the validity of social science data generated by large language models (LLMs) using a structured framework.

May 11, 2026

Evaluating the statistical realism of LLM-generated social science data.

Key Points

This research aims to assess the validity of social science data generated by large language models (LLMs) using a structured framework.
Developed SSDataBench to evaluate statistical realism in LLM-generated data.
Assessed five statistical patterns relevant to social research using longitudinal and cross-sectional datasets.
Utilized datasets across demographics, socioeconomic status, marriage, health, abilities, and attitudes.
Identified representational limitations in LLMs, showing a tendency to simplify real-world heterogeneity.
Preliminary results indicate that domain-specific training improves population-level statistical realism.
Demonstrated that LLM-generated data may not adequately reflect complex social patterns.

Abstract

Large language models (LLMs) hold great promise for generating social science data, potentially expanding the methodological toolkit of quantitative social research. Prior studies have primarily focused on individual-level predictability or behavioral plausibility of LLM-generated data. We propose a framework for assessing the validity of LLM-generated data by returning to the foundational principles of survey research in the social sciences. Just as surveys based on representative samples yield statistics that approximate the corresponding statistical moments of the target population, assessment should center on the ability of LLM-generated data to reproduce real-world, population-level statistical patterns. We introduce SSDataBench, a systematic benchmark designed to evaluate population-level statistical realism in LLM-generated social science data. The benchmark assesses five types of statistical patterns central to social research: univariate distributions, bivariate associations, multivariate outcome predictions, life event sequence distributions, and associations between life event sequences and covariates. We illustrate SSDataBench using four longitudinal datasets and three cross-sectional datasets spanning six major social domains: demographics, socioeconomic status, marriage, health, abilities, and attitudes. Our study reveals representational limitations in current LLMs under sparse conditioning settings, manifested in a pronounced tendency to compress real-world heterogeneity into simplified typological structures. Finally, we outline a roadmap toward improved statistical realism and report preliminary results indicating that domain-specific training can enhance population-level realism.

Mark Helpful

Bookmark

Relay