Unsupervised adaptation of CLIP-based vision-language models (VLMs) for fine-grained image classification requires sensitivity to microscopic local cues. While CLIP exhibits strong zero-shot transfer, its reliance on coarse global features restricts its performance on fine-grained classification tasks. Prior efforts inject fine-grained knowledge by aligning large language model (LLM) descriptions with the CLIP CLS token; however, this approach overlooks spatial precision. We propose microCLIP, a self-training framework that jointly refines CLIP's visual and textual representations using fine-grained cues. At its core is Saliency-Oriented Attention Pooling (SOAP) within a lightweight TokenFusion module, which builds a saliency-guided FG token from patch embeddings and fuses it with the global CLS token for coarse-fine alignment. To stabilize adaptation, we introduce a two-headed LLM-derived classifier: a frozen classifier that, via multi-view alignment, provides a stable text-based prior for pseudo-labeling, and a learnable classifier initialized from LLM descriptions and fine-tuned with TokenFusion. We further develop Dynamic Knowledge Aggregation, which convexly combines fixed LLM/CLIP priors with TokenFusion's evolving logits to iteratively refine pseudo-labels. Together, these components uncover latent fine-grained signals in CLIP, yielding a consistent 2. 90\% average accuracy gain across 13 fine-grained benchmarks while requiring only light adaptation. Our code is available at https: //github. com/sathiiii/microCLIP.
Building similarity graph...
Analyzing shared references across papers
Loading...
Sathira Silva
Eman Ali
Chetan Arora
Building similarity graph...
Analyzing shared references across papers
Loading...
Silva et al. (Thu,) studied this question.
www.synapsesocial.com/papers/68e7d631bd66d359be6266d8 — DOI: https://doi.org/10.48550/arxiv.2510.02270