With the prevalence of pre-trained vision-language models like CLIP, leveraging the generic knowledge embedded in CLIP for domain adaptation has proved to be a promising direction. However, most existing CLIP-based methods are limited to closed-set settings. This is primarily because CLIP needs the semantic labels of unknown classes for inference, thus making it not applicable to Open-Set Domain Adaptation (OSDA). To utilize the complementary roles of CLIP and the source model, our paper proposes a novel Semantic-guided Target Adaptation (SemTA) framework for OSDA in a training-free manner. Specifically, we introduce an unknown semantic discovery module. It uses the cluster centroids of the target data to obtain the semantic labels of unknown classes from the worldwide corpus. Then, the semantic-based inference can be performed with CLIP. Additionally, the dual sample attention mechanism is implemented to output sample-based inference. Representative features from both the source model and CLIP serve as the key to improve task specificity. Compared to previous OSDA methods which reject unknown data by confidence threshold, the proposed approach is more practical and offers better interpretability. Comprehensive evaluations on four benchmarks reveal our method sets a new state-of-the-art even without training. Our code will be publicly available soon.
Yu et al. (Wed,) studied this question.