Synthetic data generation has become a powerful solution for producing high-quality, privacy-preserving datasets, especially in domains where data sensitivity is crucial. Large Language Models (LLM) have proven their exceptional capabilities in natural language generation and, more recently, have demonstrated potential in generating tabular data. However, existing LLM-based approaches lack built-in privacy guarantees, making them susceptible to privacy breaches. To address this limitation, this work proposes DP-Tabula, an LLM-based model for tabular data generation integrating Differential Privacy into its training process. An outlier handling technique is also employed to stabilize model performance under noisy training conditions. Experiments conducted across multiple datasets reveal a privacy-utility trade-off, where the optimal noise level depends on dataset-specific characteristics. Furthermore, an intriguing finding emerges: the order of features in the input sequence significantly influences the quality of the synthetic data produced by the LLM-based model. This study offers a framework that strengthens privacy in synthetic tabular data generation and uncovers insights into the mechanics of LLM-driven data synthesis.
Niu et al. (Mon,) studied this question.