Abstract Recent contrastive vision–language models (CLIP) excel at few-shot learning but are often too large for practical deployment. To enable efficient usage, we propose a CLIP-supervised distillation framework that transfers its multimodal knowledge into lightweight vision-only networks. Unlike conventional unimodal distillation, our method uses a dual-contrastive loss to align student visual features with CLIP’s image–text embedding space, leveraging text embeddings as semantic anchors to preserve class-level feature structure. Experiments on CIFAR-100 and ImageNet show that our approach improves MobileNet accuracy by 4.83\% and outperforms existing distillation baselines, providing a compact yet semantically aligned model for efficient deployment. Code is available at https://github.com/pandeng-001/CFD-CLIP.
Building similarity graph...
Analyzing shared references across papers
Loading...
Mingyong Pang
Weiwei Zhang
Xiao Li
Huaqiao University
Building similarity graph...
Analyzing shared references across papers
Loading...
Pang et al. (Mon,) studied this question.
www.synapsesocial.com/papers/68d4758931b076d99fa6d258 — DOI: https://doi.org/10.21203/rs.3.rs-7464307/v1