Los puntos clave no están disponibles para este artículo en este momento.
Despite recent advances in large language models, building dependable and deployable NLP models typically requires abundant, high-quality training data. However, task-specific data is not available for many use cases, and manually curating task-specific data is labor-intensive. Recent work has studied prompt-driven synthetic data generation using large language models, but these generated datasets tend to lack complexity and diversity. To address these limitations, we introduce a method, DataTune, to make better use of existing, publicly available datasets to improve automatic dataset generation. DataTune performs dataset transformation, enabling the repurposing of publicly available datasets into a format that is directly aligned with the specific requirements of target tasks. On a diverse set of language-based tasks from the BIG-Bench benchmark, we find that finetuning language models via DataTune improves over a few-shot prompting baseline by 49\% and improves over existing methods that use synthetic or retrieved training data by 34\%. We find that dataset transformation significantly increases the diversity and difficulty of generated data on many tasks. We integrate DataTune into an open-source repository to make this method accessible to the community: https: //github. com/neulab/prompt2model.
Building similarity graph...
Analyzing shared references across papers
Loading...
Saumya Gandhi
Visvesvaraya National Institute of Technology
Ritu Gala
Vijay Viswanathan
Northwestern University
Building similarity graph...
Analyzing shared references across papers
Loading...
Gandhi et al. (Mon,) studied this question.
synapsesocial.com/papers/68e6e2eeb6db64358765ece9 — DOI: https://doi.org/10.48550/arxiv.2404.14361
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: