The increasing use of Large Language Models (LLMs) has enabled the generation of high-quality synthetic text, providing a potential alternative to sensitive real-world datasets in domains where privacy concerns limit data sharing. However, synthetic text is not inherently privacy safe. Fine-tuning generative models on domain-specific data can enhance semantic fidelity while simultaneously increasing the risk of memorization and information leaks. In this work, we propose a unified evaluation framework to systematically analyze the connection between utility and privacy risk in LLM-generated synthetic text. Our framework combines semantic utility metrics and practical privacy attacks within a single, controlled pipeline. The key novelty of the proposed framework is its joint evaluation of utility and privacy within a single experimental pipeline. Unlike prior studies that often assess text quality and privacy risk separately, our framework jointly measures semantic fidelity, distributional alignment, memorization behavior, and membership inference vulnerability under the same controlled protocol, enabling direct analysis of the utility–privacy trade-off in synthetic text generation. We empirically evaluate the framework using GPT-2 fine-tuned on two datasets: AG News as a general-domain benchmark and PubMed abstracts as a biomedical-domain validation dataset. Results show that fine-tuning improves semantic utility but also increases empirical privacy risk. On AG News, BERTScore increases to 0.81, while membership inference ROC-AUC rises from 0.45 to 0.64. The PubMed experiment shows the same directional trend, with improved semantic fidelity accompanied by higher canary memorization and membership inference vulnerability. Additionally, canary exposure analysis indicates clear memorization of rare sequences after fine-tuning. These findings demonstrate a measurable trade-off between utility and privacy in synthetic text generation and highlight the importance of jointly evaluating both dimensions. The proposed framework provides a reproducible methodology for assessing the privacy risks of high-quality synthetic text and supports more responsible deployment of LLM-based synthetic data systems.
Building similarity graph...
Analyzing shared references across papers
Loading...
Lubana Isaoglu
Istanbul University-Cerrahpaşa
Zeynep Orman
Istanbul University-Cerrahpaşa
Engineering Perspective
Istanbul University-Cerrahpaşa
Building similarity graph...
Analyzing shared references across papers
Loading...
Isaoglu et al. (Sat,) studied this question.
synapsesocial.com/papers/6a265ca8ad53cfb9357c5e19 — DOI: https://doi.org/10.64808/engineeringperspective.1910777