Instruction tuning is a key step in adapting large language models (LLMs) to effectively understand and follow human instructions. It enables LLMs to transform general knowledge into task-specific responses that align with user intent. Although many high-quality instruction tuning datasets have been released, efficiently utilizing these data sources during supervised fine-tuning (SFT) is important, as training on the full high-quality corpus can be computationally expensive. To address this inefficiency, we explore whether a compact, high-quality subset of instruction data can achieve comparable performance to full-dataset SFT, thereby reducing training cost without sacrificing effectiveness. To this end, this work proposes to select such a subset (a.k.a., coreset) of instruction examples that maintains comparable downstream performance while improving training efficiency. The key idea is inspired by our discovered decomposition that in instruction tuning, the training loss can be decomposed into two components that effectively quantify the contribution of an instruction to the two fundamental capabilities of LLMs, namely knowledge-related capability and instruction following capability. We then revisit the objective of the classical coreset approaches to balance the two capabilities when selecting instruction examples. Based on a bi-level formulation and a composite gradient distance that makes the objective submodular, we design an effective algorithm to achieve a bounded approximation error. Experiments on 4 datasets across 9 downstream tasks demonstrate that BRIEF reduces computational costs by 3× while improving accuracy by 5% on Llama-3.1-8B, Qwen3-4B and Mistral-Nemo-12B.
Building similarity graph...
Analyzing shared references across papers
Loading...
Chaoyuan Shen
Chi Zhang
Chengliang Chai
Proceedings of the VLDB Endowment
Massachusetts Institute of Technology
University of Arizona
Beijing Institute of Technology
Building similarity graph...
Analyzing shared references across papers
Loading...
Shen et al. (Sun,) studied this question.
www.synapsesocial.com/papers/6a002087c8f74e3340f9b6e1 — DOI: https://doi.org/10.14778/3797919.3797933
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: