Building high-quality datasets for specialized tasks is a time-consuming and resource-intensive process that often requires specialized domain knowledge. We propose Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT), a method for generating synthetic datasets, given a small number of user-written few-shots that demonstrate the task to be performed. Given these examples, CRAFT uses large-scale public web-crawled corpora and similarity-based document retrieval to find other relevant human-written documents. Lastly, instruction-tuned large language models (LLMs) augment the retrieved documents into custom-formatted task samples, which then can be used for finetuning. We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks: biology, medicine, and commonsense question-answering (QA), as well as summarization. Our experiments show that CRAFT-based models outperform or match general LLMs on QA tasks, while exceeding models trained on human-curated summarization data by 46 preference points. CRAFT outperforms other synthetic dataset generation methods such as Self- and Evol-Instruct, and remains robust even when the quality of the initial few-shots varies.
Building similarity graph...
Analyzing shared references across papers
Loading...
Ingo Ziegler
University of Copenhagen
Abdullatif Koksal
Desmond Elliott
Building similarity graph...
Analyzing shared references across papers
Loading...
Ziegler et al. (Wed,) studied this question.
synapsesocial.com/papers/69b3aaa802a1e69014ccb63a — DOI: https://doi.org/10.1162/tacl.a.56
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: