Multi-teacher knowledge distillation transfers knowledge from multiple large teacher models to a small student model and has performed well on many downstream tasks. However, when distilling knowledge from multiple teachers, it always suffers from the severe problems of being time-consuming and storage-extensive for multiple teacher models training and inference. We present MoE-KD, a simple but effective framework that produces supervision for training the student model from one single teacher model, which fixes the above problems and improves effectiveness. In the proposed MoE-KD, multiple trainable prompts are used to extract different views of samples from a single pre-trained language model and only a few parameters (prompts) need to be trained and stored. To guarantee the generated supervision signals with increased robustness and correctness, we introduce an uncertainty-based mechanism and a selector module, which routes the input instance to its corresponding teacher. We have also extended MoE KD to lifelong learning scenarios, proposing a lightweight solution for catastrophic forgetting. We conduct experiments on traditional KD scenarios and lifelong learning scenarios. MoE-KD yields improvements up to 1.1% and 140% in accuracy and efficiency in knowledge distillation and 2.8% improvements on average in lifelong learning, compared with the strong baseline methods.
Meng et al. (Sat,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: