The increasing use of synthetic data generated by Large Language Models (LLMs) presents both opportunities and challenges in data-driven applications. While synthetic data provides a cost-effective, scalable alternative to real-world data to facilitate model training, its diversity and privacy risks remain underexplored. Focusing on text-based synthetic data, we propose a comprehensive set of metrics to quantitatively assess the diversity (i.e., linguistic expression, sentiment, and user perspective), and privacy (i.e., re-identification risk and stylistic outliers) of synthetic datasets generated by several state-of-the-art LLMs. Experiment results reveal significant limitations in LLMs' capabilities in generating diverse and privacy-preserving synthetic data. Guided by the evaluation results, a prompt-based approach is proposed to enhance the diversity of synthetic reviews while preserving reviewer privacy.
Building similarity graph...
Analyzing shared references across papers
Loading...
Tevin Atwal
Chan Nam Tieu
Yefeng Yuan
Nanchang University
Building similarity graph...
Analyzing shared references across papers
Loading...
Atwal et al. (Thu,) studied this question.
synapsesocial.com/papers/68f19f20de32064e504ddf4d — DOI: https://doi.org/10.48550/arxiv.2507.18055
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: