What type of study is this?

This is a Experimental Study study.

October 16, 2025Open Access

M²IV: Towards Efficient and Fine-grained Multimodal In-Context Learning via Representation Engineering

Key Points

M$^2$IV significantly improves performance across tasks, achieving a 3.74% accuracy gain in LVLMs.
By replacing token-level demonstrations with learnable multimodal in-context vectors, efficiency is notably enhanced.
The training strategy leverages multi-head attention and multi-layer perceptrons for robust cross-modal representation learning.
VLibrary provides a flexible repository for customized retrieval and injection of trained multimodal in-context vectors.

Abstract

Multimodal in-context learning (ICL) equips Large Vision-language Models (LVLMs) with the ability to adapt to new tasks via multiple user-provided demonstrations, without requiring any model parameter updates. However, its effectiveness is constrained by the token-intensive nature of multimodal inputs and the complexity of cross-modal few-shot reasoning, which together hinder LVLMs from extracting useful patterns from demonstrations. To address these challenges, we propose M²IV, a novel representation engineering approach that replaces explicit token-level demonstrations with a set of learnable Multimodal In-context Vectors directly injected into the residual streams of LVLMs. By analyzing the distinct roles of multi-head attention (MHA) and multi-layer perceptrons (MLP) in the ICL process, we design a training strategy that enables M²IV to perform fine-grained semantic distillation and robust cross-modal representation learning. M²IV not only improves performance across diverse tasks and LVLMs but also significantly reduces token overhead, enabling graceful scaling to many-shot scenarios. To further enhance usability, we introduce VLibrary, a repository that stores trained M²IVs for flexible retrieval and injection. With VLibrary, users can steer pre-trained LVLMs in a customized manner that meets diverse requirements. Extensive experiments demonstrate that M²IV consistently outperforms vanilla ICL and prior representation engineering baselines, achieving an average accuracy gain of 3. 74\% with substantial improvements in overall efficiency.

Read Full Paperexternally

AIに質問

Bookmark

View Full Paper

Cite This Study

Li et al. (Sun,) studied this question.

synapsesocial.com/papers/68f0f51d8dd8ea469b1d6f7f https://doi.org/https://doi.org/10.48550/arxiv.2504.04633

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

AIに質問

Bookmark

View Full Paper