April 22, 2024Open Access

Better Synthetic Data by Retrieving and Transforming Existing Datasets

Key Points

Key points are not available for this paper at this time.

Abstract

Despite recent advances in large language models, building dependable and deployable NLP models typically requires abundant, high-quality training data. However, task-specific data is not available for many use cases, and manually curating task-specific data is labor-intensive. Recent work has studied prompt-driven synthetic data generation using large language models, but these generated datasets tend to lack complexity and diversity. To address these limitations, we introduce a method, DataTune, to make better use of existing, publicly available datasets to improve automatic dataset generation. DataTune performs dataset transformation, enabling the repurposing of publicly available datasets into a format that is directly aligned with the specific requirements of target tasks. On a diverse set of language-based tasks from the BIG-Bench benchmark, we find that finetuning language models via DataTune improves over a few-shot prompting baseline by 49\% and improves over existing methods that use synthetic or retrieved training data by 34\%. We find that dataset transformation significantly increases the diversity and difficulty of generated data on many tasks. We integrate DataTune into an open-source repository to make this method accessible to the community: https: //github. com/neulab/prompt2model.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Gandhi et al. (Mon,) studied this question.

synapsesocial.com/papers/68e6e2eeb6db64358765ece9 — DOI: https://doi.org/10.48550/arxiv.2404.14361

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Improving Training Dataset Balance with ChatGPT Prompt Engineering· 2024 · 18 citations
Best Practices and Lessons Learned on Synthetic Data· 2024 · 15 citations
Curating Grounded Synthetic Data with Global Perspectives for Equitable AI· 2024 · 2 citations
Prompting-based Synthetic Data Generation for Few-Shot Question Answering· 2024
Synthetic Data Generation for Supervised Fine-Tuning: A Comprehensive Survey· 2026

Authors

Saumya Gandhi

Ritu Gala

Vijay Viswanathan

Northwestern University

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Better Synthetic Data by Retrieving and Transforming Existing Datasets

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Also consider