What question did this study set out to answer?

The aim is to improve fine-grained image classification under few-shot learning by integrating CLIP embeddings and a Siamese network.

March 28, 2026Open Access

Cross-Modal Few-Shot Learning via Siamese Similarity Networks on CLIP Embeddings for Fine-Grained Image Classification

Key Points

The aim is to improve fine-grained image classification under few-shot learning by integrating CLIP embeddings and a Siamese network.
Developed a cross-modal framework using CLIP embeddings and a Siamese similarity network.
Formed hybrid similarity pairs combining images and textual class descriptions.
Employed a dual-branch CLIP encoder and a contrastive loss function to enhance learning efficiency.
Conducted experiments on benchmark datasets like miniImageNet and CUB-200-2011.
Achieved 70.32% accuracy and 71.15% F1-score on 5-way 1-shot tasks.
Attained 78.41% accuracy and 79.02% F1-score on 5-way 5-shot tasks.
Demonstrated substantial improvements over zero-shot and few-shot baselines with confidence intervals in evaluations.

Abstract

Fine-grained image classification under few-shot learning conditions remains a significant challenge due to limited labeled data and high intra-class similarity. This paper proposes a novel cross-modal framework that integrates Contrastive Language-Image Pretraining (CLIP) embeddings within a Siamese similarity network to enable robust and label-efficient classification. By leveraging the semantic alignment between textual class descriptions and visual representations, the model forms hybrid similarity pairs of image-to-image and image-to-text within a shared latent space, facilitating discriminative learning under low-shot scenarios. The architecture employs a dual-branch CLIP encoder and a contrastive loss function to optimize intra-class compactness and inter-class separability. Experiments conducted on benchmark datasets including miniImageNet and CUB-200-2011 demonstrate substantial improvements over zero-shot and few-shot baselines, achieving 70.32% accuracy, 71.15% F1-score, and 68.47% mAP on 5-way 1-shot and 78.41% accuracy, 79.02% F1-score, and 76.83% mAP on 5-way 5-shot tasks (averaged over 600 episodes with 95% confidence intervals on the CUB-200-2011 dataset). Extended evaluations under 10-way settings show similarly strong performance. Ablation studies further validate the critical roles of contrastive learning, normalization, and cross-modal embeddings in enhancing generalization. This work presents a scalable and interpretable paradigm for fine-grained classification in data-scarce domains.

Cross-Modal Few-Shot Learning via Siamese Similarity Networks on CLIP Embeddings for Fine-Grained Image Classification

Key Points

Abstract

Cite This Study