Using Large Language Models (LLMs) to generate synthetic data for model training has become increasingly popular in recent years. While LLMs are capable of producing realistic training data, the effectiveness of data generation is influenced by various factors, including the choice of prompt, task complexity, and the quality, quantity, and diversity of the generated data. In this work, we focus exclusively on using synthetic data for text classification tasks. Specifically, we use natural language understanding (NLU) models trained on synthetic data to assess the quality of synthetic data from different generation approaches. This work provides an empirical analysis of the impact of these factors and offers recommendations for better data generation practices.
Building similarity graph...
Analyzing shared references across papers
Loading...
Yinheng Li
Microsoft Research (United Kingdom)
Rogerio Bonatti
Microsoft (United States)
Sara Abdali
University of California, Riverside
Building similarity graph...
Analyzing shared references across papers
Loading...
Li et al. (Thu,) studied this question.
synapsesocial.com/papers/68e6300ab6db6435875c21bc — DOI: https://doi.org/10.48550/arxiv.2407.12813
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: