What question did this study set out to answer?

The aim is to create a model that generates synthetic tabular data while ensuring privacy guarantees through differential privacy.

February 11, 2026Open Access

DP-Tabula: Differentially Private Synthetic Tabular Data Generation with Large Language Models

Key Points

The aim is to create a model that generates synthetic tabular data while ensuring privacy guarantees through differential privacy.
Developed DP-Tabula, an LLM-based model for tabular data generation incorporating differential privacy.
Employed outlier handling techniques to stabilize performance during training.
Conducted experiments on multiple datasets to analyze privacy-utility trade-offs.
Identified an optimal noise level that varies based on dataset-specific characteristics.
Demonstrated the significant impact of feature input order on the quality of the generated synthetic data.

Abstract

Synthetic data generation has become a powerful solution for producing high-quality, privacy-preserving datasets, especially in domains where data sensitivity is crucial. Large Language Models (LLM) have proven their exceptional capabilities in natural language generation and, more recently, have demonstrated potential in generating tabular data. However, existing LLM-based approaches lack built-in privacy guarantees, making them susceptible to privacy breaches. To address this limitation, this work proposes DP-Tabula, an LLM-based model for tabular data generation integrating Differential Privacy into its training process. An outlier handling technique is also employed to stabilize model performance under noisy training conditions. Experiments conducted across multiple datasets reveal a privacy-utility trade-off, where the optimal noise level depends on dataset-specific characteristics. Furthermore, an intriguing finding emerges: the order of features in the input sequence significantly influences the quality of the synthetic data produced by the LLM-based model. This study offers a framework that strengthens privacy in synthetic tabular data generation and uncovers insights into the mechanics of LLM-driven data synthesis.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Niu et al. (Mon,) studied this question.

synapsesocial.com/papers/698c1bef267fb587c655decf https://doi.org/https://doi.org/10.5167/uzh-291175

Bookmark

View Full Paper