June 27, 2024Open Access

Data Generation Using Large Language Models for Text Classification: An Empirical Case Study

YLYinheng LiMicrosoft Research (United Kingdom)RBRogerio BonattiMicrosoft (United States)SASara AbdaliUniversity of California, Riverside

Puntos clave

Synthetic data improves text classification, highlighting the importance of prompt selection and task complexity.
Quality of generated data strongly affects model training outcomes, stressing the need for careful generation practices.
Empirical analysis conducted using large language models for generating synthetic data across multiple classification tasks reveals key insights about effectiveness and diversity of approaches. It identifies factors influencing data quality, promoting enhanced generation techniques to yield better results in natural language understanding tasks and applications in AI development. Supports more effective training strategies, indicating potential for future AI advancements.

Resumen

Using Large Language Models (LLMs) to generate synthetic data for model training has become increasingly popular in recent years. While LLMs are capable of producing realistic training data, the effectiveness of data generation is influenced by various factors, including the choice of prompt, task complexity, and the quality, quantity, and diversity of the generated data. In this work, we focus exclusively on using synthetic data for text classification tasks. Specifically, we use natural language understanding (NLU) models trained on synthetic data to assess the quality of synthetic data from different generation approaches. This work provides an empirical analysis of the impact of these factors and offers recommendations for better data generation practices.

Preguntar a la IA

Me gusta

Guardar

Ver artículo completo