April 22, 2024Open Access

Better Synthetic Data by Retrieving and Transforming Existing Datasets

Puntos clave

Los puntos clave no están disponibles para este artículo en este momento.

Resumen

Despite recent advances in large language models, building dependable and deployable NLP models typically requires abundant, high-quality training data. However, task-specific data is not available for many use cases, and manually curating task-specific data is labor-intensive. Recent work has studied prompt-driven synthetic data generation using large language models, but these generated datasets tend to lack complexity and diversity. To address these limitations, we introduce a method, DataTune, to make better use of existing, publicly available datasets to improve automatic dataset generation. DataTune performs dataset transformation, enabling the repurposing of publicly available datasets into a format that is directly aligned with the specific requirements of target tasks. On a diverse set of language-based tasks from the BIG-Bench benchmark, we find that finetuning language models via DataTune improves over a few-shot prompting baseline by 49\% and improves over existing methods that use synthetic or retrieved training data by 34\%. We find that dataset transformation significantly increases the diversity and difficulty of generated data on many tasks. We integrate DataTune into an open-source repository to make this method accessible to the community: https: //github. com/neulab/prompt2model.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Saumya Gandhi

Visvesvaraya National Institute of Technology

Ritu Gala

Vijay Viswanathan

Northwestern University

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Better Synthetic Data by Retrieving and Transforming Existing Datasets

Puntos clave

Resumen

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Also consider