What question did this study set out to answer?

We aim to address semantic mismatch in knowledge distillation by aligning the receptive fields of teacher and student networks.

June 20, 2026

Looking Broader for Knowledge Distillation Via Receptive-Field Alignment

Key Points

We aim to address semantic mismatch in knowledge distillation by aligning the receptive fields of teacher and student networks.
Proposed a one-to-all spatial matching approach for knowledge distillation.
Utilized a Target-aware Transformer to produce a similarity map for distilling teacher features to student features.
Implemented an efficient matrix multiplication to integrate feature pixels from multiple spatial positions.
Demonstrated superior performance in image classification, semantic segmentation, and object detection.
Showed improved alignment of receptive fields between teacher and student networks, reducing the semantic mismatch issue.
Validated broad generalization capability across various backbone networks.

Abstract

Semantic mismatch remains a key challenge in conventional knowledge distillation, where representational features are typically regressed from the teacher to the student in a one-to-one spatial matching fashion. In this paper, we address semantic mismatch by examining architectural differences between teacher and student networks. Specifically, due to the variations in network width and depth, the teacher network has a larger receptive field than the student, enabling it to integrate a broader spatial context. In contrast, the student model captures more localized features. This disparity exacerbates semantic misalignment. To alleviate this issue, we propose a novel one-to-all spatial matching knowledge distillation approach, wherein each pixel of the teacher's feature is distilled to all spatial locations of the student's feature map, weighted by a similarity map produced by a Target-aware Transformer (TaT). To further enhance TaT, we reduce its quadratic computational complexity and prevent incorrect spatial alignment, such as distilling background regions from the teacher to foreground regions in the student, and vice versa. In addition, we introduce the "looking broader" strategy, which rearranges the distilled representations of the student and teacher to align their receptive fields. This strategy is motivated by the observation that while individual pixels in student features typically have smaller receptive fields, aggregating multiple pixels can effectively bridge this gap. Therefore, we propose integrating feature pixels from multiple spatial positions using an efficient matrix multiplication. We validate our method through extensive experiments and demonstrate its superior performance and broad generalization capability across various backbone networks and vision tasks, including image classification, semantic segmentation, and object detection.

Bookmark

Looking Broader for Knowledge Distillation Via Receptive-Field Alignment

Key Points

Abstract

Cite This Study