What question did this study set out to answer?

The aim is to determine if a compact subset of instruction data can achieve similar performance as full dataset supervised fine-tuning.

May 10, 2026

BRIEF: Bi-Level Coreset Selection for Efficient Instruction Tuning in LLMs

Key Points

The aim is to determine if a compact subset of instruction data can achieve similar performance as full dataset supervised fine-tuning.
Proposed a bi-level formulation based on decomposition of training loss into two components.
Designed an algorithm that uses composite gradient distance to select a high-quality coreset.
Conducted experiments across 4 datasets and 9 downstream tasks.
Achieved a 3× reduction in computational costs.
Improved accuracy by 5% on Llama-3.1-8B, Qwen3-4B, and Mistral-Nemo-12B.

Abstract

Instruction tuning is a key step in adapting large language models (LLMs) to effectively understand and follow human instructions. It enables LLMs to transform general knowledge into task-specific responses that align with user intent. Although many high-quality instruction tuning datasets have been released, efficiently utilizing these data sources during supervised fine-tuning (SFT) is important, as training on the full high-quality corpus can be computationally expensive. To address this inefficiency, we explore whether a compact, high-quality subset of instruction data can achieve comparable performance to full-dataset SFT, thereby reducing training cost without sacrificing effectiveness. To this end, this work proposes to select such a subset (a.k.a., coreset) of instruction examples that maintains comparable downstream performance while improving training efficiency. The key idea is inspired by our discovered decomposition that in instruction tuning, the training loss can be decomposed into two components that effectively quantify the contribution of an instruction to the two fundamental capabilities of LLMs, namely knowledge-related capability and instruction following capability. We then revisit the objective of the classical coreset approaches to balance the two capabilities when selecting instruction examples. Based on a bi-level formulation and a composite gradient distance that makes the objective submodular, we design an effective algorithm to achieve a bounded approximation error. Experiments on 4 datasets across 9 downstream tasks demonstrate that BRIEF reduces computational costs by 3× while improving accuracy by 5% on Llama-3.1-8B, Qwen3-4B and Mistral-Nemo-12B.

AIに質問

Bookmark

AIに質問

Bookmark

BRIEF: Bi-Level Coreset Selection for Efficient Instruction Tuning in LLMs

Key Points

Abstract

Cite This Study