What question did this study set out to answer?

This research aims to enhance image tagging automation through light-weight networks and random perturbation techniques.

February 8, 2026

AutoIT: Automated Image Tagging with Random Perturbation

Key Points

This research aims to enhance image tagging automation through light-weight networks and random perturbation techniques.
Developed a lightweight adapter network
Implemented a random perturbation mechanism
Created a fully automated pipeline for image tagging using large language models
Conducted experiments on public benchmarks
Improved cross-modal alignment within the adapter’s embedding space
Revealed the significance of pre-trained embeddings’ magnitude for classifier performance
Challenged previous normalization approaches in embedding space
Emphasized the importance of generated text quantity and quality for tagging efficiency

Abstract

Vision-language pre-training (VLP) models have been explored as a means to bridge the text and image modalities, allowing to learn visual classifiers using only texts for image tagging. However, existing methods rely heavily on prompt tuning, which becomes computationally prohibitive when managing a vast array of candidate labels. In this study, we present a lightweight adapter network paired with an effective random perturbation mechanism, facilitating the creation of label classifiers with augmented cross-modal transfer capabilities. Together with large language models for multi-label text generation, a fully automated pipeline for image tagging is developed without relying on any manually curated data. Through comprehensive experiments on public benchmarks, we empirically reveal the nature of random perturbation in improving cross-modal alignment within the adapter’s embedding space. Our findings also emphasize the critical role of pre-trained embeddings’ magnitude in enhancing cross-modal classifier performance, challenging the prevailing focus on normalization of the embedding space. Alongside empirical results concerning the impact of both the quantity and quality of generated texts and the efficiency of the adapter, our pivotal insights into the automated image tagging paradigm are expected to advance future research efforts within the community.

Bookmark

Cite This Study

Zhu et al. (Fri,) studied this question.

synapsesocial.com/papers/6987eb5df6bacdd2fe8fc91b https://doi.org/https://doi.org/10.1007/s11263-026-02737-y

Bookmark