National statistical agencies increasingly face budget constraints and shrinking sample sizes, while simultaneously gaining access to rich auxiliary data and powerful pre-trained machine learning (ML) and artificial intelligence (AI) models, including Large Language Models (LLMs). Traditional model-assisted estimation techniques, which fit models using survey sample data, are limited by small sample sizes, struggle to leverage complex non-linear relationships in auxiliary data, and cannot accommodate frontier pre-trained models. This work re-examines the use of pre-trained black-box models, fit independently of the survey sample, for design-based parameter estimation. Inspired by the Prediction-Powered Inference (PPI) framework, we introduce the Prediction-Powered Estimator (PPE), an unbiased estimator with an unbiased variance estimator for the survey design setting. We also formalize the use of pre-trained models with the classic difference estimator—which we term the Prediction-Powered Difference (PPD) estimator—and with the Generalized Regression Estimator via predicted values as covariates (GREG y ^). Through LLM-based use-cases leveraging unstructured auxiliary data (images and text) and experiments with real-world survey data from Statistics Canada, complemented by simulation studies in the Supplemental Material, we demonstrate that these approaches consistently outperform standard baseline estimators across bias, mean absolute error, mean squared error, coverage, and confidence interval width. The results suggest that pre-trained models can yield more accurate and efficient estimates while potentially reducing survey sample sizes and respondent burden, and motivate expanding the survey methodologist’s toolbox to include pre-trained models and novel auxiliary data sources.
Denis et al. (Tue,) studied this question.