What question did this study set out to answer?

This research aims to enhance few-shot generalization in vision-language models by introducing MMA++, a parameter-efficient adaptation framework.

May 28, 2026

MMA++: Effective Multi-Modal Adaptation for Vision-Language Models

Key Points

This research aims to enhance few-shot generalization in vision-language models by introducing MMA++, a parameter-efficient adaptation framework.
Introduced the MMA++ framework for selective adapter application in vision and text encoders based on dataset analysis.
Proposed an alpha-consistency framework consisting of consistency training and alpha-decoupling strategies.
Evaluated performance on various few-shot generalization tasks, including base-to-novel generalization and domain transfer.
Achieved leading performance across few-shot generalization tasks, outperforming prior approaches.
Demonstrated adaptability of the fusion scale alpha based on training data size, improving model integration.
Empirical evaluations confirmed the efficacy of MMA++ in bridging modality gaps for effective VLM adaptation.

Abstract

Large scale pre-trained Vision-Language Models (VLMs) have shown good generalization capabilities across diverse downstream tasks. However, adapting such large-scale models to few-shot generalization scenarios remains challenging due to the trade-off between preserving general knowledge and incorporating task-specific information. In this paper, we propose MMA++, an advanced and effective Multi-Modal Adapter framework for parameter-efficient VLM adaptation. Unlike prior works that independently inject adapters into each modality or uniformly across layers, MMA++ performs a dataset-level analysis to identify discriminative and generalizable features, and selectively applies adapters to the higher layers of both vision and text encoders. To bridge the modality gap, we further propose a shared feature projection space that enhances alignment between modalities. Beyond architecture design, we identify the fusion scale -which controls the strength of adapter integration-as a key factor in few-shot generalization. We empirically and theoretically demonstrate that should not be static, but adapted based on training data size. To reduce the effort of tuning this value across different datasets, we propose the -consistency framework, consisting of: (1) a consistency training strategy under varying fusion scales; and (2) an -decoupling strategy that uses a larger fusion scale during training and a smaller one at inference to account for sample size mismatch. We evaluate MMA++ on a wide range of few-shot generalization tasks, including base-to-novel generalization, cross-dataset transfer, and domain generalization. Our method consistently achieves leading performance.

اسأل الذكاء الاصطناعي

Bookmark