What type of study is this?

September 10, 2025Open Access

A Multi-Task Learning Framework Based on CLIP and Adapter Modules

Key Points

Our approach achieves up to 12% performance improvement while adding less than 0.2% parameters.
By using lightweight adapter modules, the model maintains CLIP's original zero-shot capabilities.
This framework facilitates adaptation across classification, image-text retrieval, and regression tasks.
It provides significant advantages over conventional transfer strategies, enhancing task generalization.

Abstract

In recent years, with the rapid development of cross-modal learning, pretrained models such as CLIP have demonstrated powerful zero-shot capabilities in image-text alignment tasks, making them central to multimodal research. However, a key challenge remains: how to effectively transfer these capabilities while preserving the strengths of CLIP. To address this, we propose a parameter-efficient multi-task fine-tuning frameworkMulti-Task CLIP-Adapter. By inserting lightweight Adapter modules after the frozen CLIP encoder, our method enables unified adaptation across multiple tasks, including classification, image-text retrieval, and regression. Experimental results show that our approach achieves an 8%12% performance improvement with less than 0.2% additional parameters, while maintaining the original models zero-shot capability. Compared to the original CLIP and conventional transfer strategies, the Multi-Task CLIP-Adapter offers significant advantages in parameter efficiency and task generalization, paving a new path for scalable applications of large multimodal models.

Read Full Paperexternally

AI से पूछें

Bookmark

View Full Paper