Semantic mismatch remains a key challenge in conventional knowledge distillation, where representational features are typically regressed from the teacher to the student in a one-to-one spatial matching fashion. In this paper, we address semantic mismatch by examining architectural differences between teacher and student networks. Specifically, due to the variations in network width and depth, the teacher network has a larger receptive field than the student, enabling it to integrate a broader spatial context. In contrast, the student model captures more localized features. This disparity exacerbates semantic misalignment. To alleviate this issue, we propose a novel one-to-all spatial matching knowledge distillation approach, wherein each pixel of the teacher's feature is distilled to all spatial locations of the student's feature map, weighted by a similarity map produced by a Target-aware Transformer (TaT). To further enhance TaT, we reduce its quadratic computational complexity and prevent incorrect spatial alignment, such as distilling background regions from the teacher to foreground regions in the student, and vice versa. In addition, we introduce the "looking broader" strategy, which rearranges the distilled representations of the student and teacher to align their receptive fields. This strategy is motivated by the observation that while individual pixels in student features typically have smaller receptive fields, aggregating multiple pixels can effectively bridge this gap. Therefore, we propose integrating feature pixels from multiple spatial positions using an efficient matrix multiplication. We validate our method through extensive experiments and demonstrate its superior performance and broad generalization capability across various backbone networks and vision tasks, including image classification, semantic segmentation, and object detection.
Lin et al. (Thu,) studied this question.