Abstract Recent contrastive vision–language models (CLIP) excel at few-shot learning but are often too large for practical deployment. To enable efficient usage, we propose a CLIP-supervised distillation framework that transfers its multimodal knowledge into lightweight vision-only networks. Unlike conventional unimodal distillation, our method uses a dual-contrastive loss to align student visual features with CLIP’s image–text embedding space, leveraging text embeddings as semantic anchors to preserve class-level feature structure. Experiments on CIFAR-100 and ImageNet show that our approach improves MobileNet accuracy by 4.83\% and outperforms existing distillation baselines, providing a compact yet semantically aligned model for efficient deployment. Code is available at https://github.com/pandeng-001/CFD-CLIP.
Pang et al. (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: