What question did this study set out to answer?

This research aims to improve knowledge distillation by using a single teacher model while enhancing training efficiency and robustness.

January 26, 2026

Orchestrating Prompt Expertise: Enhancing Knowledge Distillation via Expert-Guided Tuning

Puntos clave

This research aims to improve knowledge distillation by using a single teacher model while enhancing training efficiency and robustness.
Developed the MoE-KD framework for knowledge distillation using multiple trainable prompts from a single teacher model.
Introduced an uncertainty-based mechanism for generating robust supervision signals.
Implemented a selector module to associate input instances with the appropriate teacher model.
Extended the framework for lifelong learning scenarios to address catastrophic forgetting.
Achieved improvements of up to 1.1% in accuracy for knowledge distillation tasks.
Increased efficiency in training and inference by 140% compared to baseline methods.
Obtained average improvements of 2.8% in lifelong learning scenarios.

Resumen

Multi-teacher knowledge distillation transfers knowledge from multiple large teacher models to a small student model and has performed well on many downstream tasks. However, when distilling knowledge from multiple teachers, it always suffers from the severe problems of being time-consuming and storage-extensive for multiple teacher models training and inference. We present MoE-KD, a simple but effective framework that produces supervision for training the student model from one single teacher model, which fixes the above problems and improves effectiveness. In the proposed MoE-KD, multiple trainable prompts are used to extract different views of samples from a single pre-trained language model and only a few parameters (prompts) need to be trained and stored. To guarantee the generated supervision signals with increased robustness and correctness, we introduce an uncertainty-based mechanism and a selector module, which routes the input instance to its corresponding teacher. We have also extended MoE KD to lifelong learning scenarios, proposing a lightweight solution for catastrophic forgetting. We conduct experiments on traditional KD scenarios and lifelong learning scenarios. MoE-KD yields improvements up to 1.1% and 140% in accuracy and efficiency in knowledge distillation and 2.8% improvements on average in lifelong learning, compared with the strong baseline methods.

Preguntar a la IA

Me gusta

Guardar