Fine-grained image classification under few-shot learning conditions remains a significant challenge due to limited labeled data and high intra-class similarity. This paper proposes a novel cross-modal framework that integrates Contrastive Language-Image Pretraining (CLIP) embeddings within a Siamese similarity network to enable robust and label-efficient classification. By leveraging the semantic alignment between textual class descriptions and visual representations, the model forms hybrid similarity pairs of image-to-image and image-to-text within a shared latent space, facilitating discriminative learning under low-shot scenarios. The architecture employs a dual-branch CLIP encoder and a contrastive loss function to optimize intra-class compactness and inter-class separability. Experiments conducted on benchmark datasets including miniImageNet and CUB-200-2011 demonstrate substantial improvements over zero-shot and few-shot baselines, achieving 70.32% accuracy, 71.15% F1-score, and 68.47% mAP on 5-way 1-shot and 78.41% accuracy, 79.02% F1-score, and 76.83% mAP on 5-way 5-shot tasks (averaged over 600 episodes with 95% confidence intervals on the CUB-200-2011 dataset). Extended evaluations under 10-way settings show similarly strong performance. Ablation studies further validate the critical roles of contrastive learning, normalization, and cross-modal embeddings in enhancing generalization. This work presents a scalable and interpretable paradigm for fine-grained classification in data-scarce domains.
Olaniyan et al. (Thu,) studied this question.