What type of study is this?

This is a Quantitative Study study.

September 23, 2025Open Access

CFD-CLIP: Contrastive Feature Distillation with CLIP for Image Classification

Key Points

The lightweight vision-only model significantly improves accuracy while retaining semantic alignment.
Experiments demonstrated a 4.83% accuracy increase for MobileNet on CIFAR-100 and ImageNet datasets.
The method employs a novel dual-contrastive loss to align features with CLIP’s embeddings.
Knowledge transfer from CLIP enables efficient usage in practical image classification applications.

Abstract

Abstract Recent contrastive vision–language models (CLIP) excel at few-shot learning but are often too large for practical deployment. To enable efficient usage, we propose a CLIP-supervised distillation framework that transfers its multimodal knowledge into lightweight vision-only networks. Unlike conventional unimodal distillation, our method uses a dual-contrastive loss to align student visual features with CLIP’s image–text embedding space, leveraging text embeddings as semantic anchors to preserve class-level feature structure. Experiments on CIFAR-100 and ImageNet show that our approach improves MobileNet accuracy by 4.83\% and outperforms existing distillation baselines, providing a compact yet semantically aligned model for efficient deployment. Code is available at https://github.com/pandeng-001/CFD-CLIP.

Read Full Paperexternally

Perguntar à IA

Bookmark

View Full Paper