Large language models (LLMs) now routinely contain hundreds of billions of parameters, making them prohibitively expensive to run in latency- or resource-constrained settings. Knowledge distillation offers a principled way to compress such models, yet prevailing approaches train a single, general-purpose student and therefore fail to exploit the rich, task-specific behaviours latent in the teacher. We propose a three-stage framework that (i) clusters teacher responses to uncover coherent behavioural modes, (ii) trains a lightweight student on each cluster by token-level imitation, and (iii) reinforces each student with a self-refinement loop guided by task-aligned rewards. Using GPT-4 as the teacher and Flan-T5-Small or LLaMA2-7B as the base students, our method produces task-specific experts that equal or surpass a distilled generalist while reducing inference cost by an order of magnitude. The framework thus bridges the gap between the versatility of large models and the practical demands of specialised, deployable systems.
Building similarity graph...
Analyzing shared references across papers
Loading...
Zhangqi Liu
Applied and Computational Engineering
John Brown University
Building similarity graph...
Analyzing shared references across papers
Loading...
Zhangqi Liu (Tue,) studied this question.
www.synapsesocial.com/papers/68af5f1ead7bf08b1eae2450 — DOI: https://doi.org/10.54254/2755-2721/2025.26304