Vision-language pre-training (VLP) models have been explored as a means to bridge the text and image modalities, allowing to learn visual classifiers using only texts for image tagging. However, existing methods rely heavily on prompt tuning, which becomes computationally prohibitive when managing a vast array of candidate labels. In this study, we present a lightweight adapter network paired with an effective random perturbation mechanism, facilitating the creation of label classifiers with augmented cross-modal transfer capabilities. Together with large language models for multi-label text generation, a fully automated pipeline for image tagging is developed without relying on any manually curated data. Through comprehensive experiments on public benchmarks, we empirically reveal the nature of random perturbation in improving cross-modal alignment within the adapter’s embedding space. Our findings also emphasize the critical role of pre-trained embeddings’ magnitude in enhancing cross-modal classifier performance, challenging the prevailing focus on normalization of the embedding space. Alongside empirical results concerning the impact of both the quantity and quality of generated texts and the efficiency of the adapter, our pivotal insights into the automated image tagging paradigm are expected to advance future research efforts within the community.
Zhu et al. (Fri,) studied this question.