ABSTRACT Knowledge distillation (KD) is a widely used technique for transferring predictive behavior from a high‐capacity teacher model to a compact student model, providing a scalable strategy to compress and adapt foundation models to downstream tasks while allowing the distillation process to be tailored toward the target application. Its success spans both computer vision and natural language processing domains, where KD enables faster inference and greater accessibility without requiring costly retraining of large models. Despite its empirical prominence, the body of work addressing its theoretical justification remains relatively sparse. In this work, we present a systematic overview of the theoretical foundations of knowledge distillation. Specifically, we examine perspectives that frame KD as smoothing label distributions, regularizing empirical risk, and approximating mutual information, aiming to bridge the gap between practical utility and theoretical insight. We evaluate the impact of each theoretical perspective through image classification experiments on CIFAR‐10, examining how these interpretations manifest in practical distillation outcomes. This article is categorized under: Statistical Learning and Exploratory Methods of the Data Sciences > Modeling Methods Statistical Learning and Exploratory Methods of the Data Sciences > Neural Networks Statistical Models > Classification Models
Building similarity graph...
Analyzing shared references across papers
Loading...
Chuanhui Liu
Hong Yin
Xiao Wang
Wiley Interdisciplinary Reviews Computational Statistics
Purdue University West Lafayette
Building similarity graph...
Analyzing shared references across papers
Loading...
Liu et al. (Wed,) studied this question.
www.synapsesocial.com/papers/68e9b1d0ba7d64b6fc13292b — DOI: https://doi.org/10.1002/wics.70049