What does this research mean for the field?

CRAFT generates high-quality synthetic datasets that outperform general LLMs on question-answering tasks and exceed human-curated summarization data by 46 preference points. Novelty: ClaimNovelty.METHODOLOGICAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

The aim is to develop an efficient method for generating synthetic datasets from minimal user input.

March 13, 2026Open Access

CRAFT Your Dataset:Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation

Key Points

The aim is to develop an efficient method for generating synthetic datasets from minimal user input.
Developed Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT) method.
Used similarity-based document retrieval on large-scale web-crawled corpora.
Employed instruction-tuned LLMs to augment retrieved documents into task samples.
CRAFT outperformed general LLMs on QA tasks.
Exhibited 46 preference points improvement on summarization compared to human-curated datasets.
Showed robustness with varying quality of initial few-shots.

Abstract

Building high-quality datasets for specialized tasks is a time-consuming and resource-intensive process that often requires specialized domain knowledge. We propose Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT), a method for generating synthetic datasets, given a small number of user-written few-shots that demonstrate the task to be performed. Given these examples, CRAFT uses large-scale public web-crawled corpora and similarity-based document retrieval to find other relevant human-written documents. Lastly, instruction-tuned large language models (LLMs) augment the retrieved documents into custom-formatted task samples, which then can be used for finetuning. We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks: biology, medicine, and commonsense question-answering (QA), as well as summarization. Our experiments show that CRAFT-based models outperform or match general LLMs on QA tasks, while exceeding models trained on human-curated summarization data by 46 preference points. CRAFT outperforms other synthetic dataset generation methods such as Self- and Evol-Instruct, and remains robust even when the quality of the initial few-shots varies.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Ingo Ziegler

University of Copenhagen

Abdullatif Koksal

Desmond Elliott

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

CRAFT Your Dataset:Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Also consider