The problem of optimizing neural networks for large language models (LLMs) such as ChatGPT is discussed. One of the directions being developed for optimizing LLMs is knowledge distillation—the transfer of knowledge from a large teacher model to a smaller student model without significant loss of accuracy of the result. The existing methods of knowledge distillation have certain disadvantages: inaccurate knowledge transfer, long learning process, and error accumulation in long sequences. A combination of methods that contribute to improving the quality of knowledge distillation is considered: selective teacher intervention in the student’s learning process and low-rank adaptation. The proposed combination of knowledge distillation methods can be applied to problems with limited computational resources.
Sikarev et al. (Mon,) studied this question.