What question did this study set out to answer?

The objective is to enhance multimodal in-context learning by integrating task-specific retrieval and reasoning mechanisms.

March 17, 2026Open Access

GPT-MM: Improving Multimodal In-context Learning with Task-specific Retrieval and Reasoning

Key Points

The objective is to enhance multimodal in-context learning by integrating task-specific retrieval and reasoning mechanisms.
Proposed a unified ICL framework combining task-aware demonstration retrieval and label-induced reasoning.
Validated the framework using textual relation extraction as a primary test case.
Extended the framework to visual question answering and audio question answering.
The framework significantly narrowed the performance gap between ICL and fully supervised models.
Consistently outperformed GPT-3 and GPT-4 baselines across textual and multimodal benchmarks.
Achieved competitive or superior results compared to fine-tuned models.

Abstract

Large language models (LLMs) have exhibited impressive generalization through in-context learning (ICL), yet most studies focus on textual tasks, leaving the mechanisms that enable ICL to generalize across modalities largely unexplored. To bridge this gap, we propose a unified ICL framework that integrates task-aware demonstration retrieval and label-induced reasoning as two complementary components for improving both accuracy and interpretability. We first validate the framework in textual relation extraction (RE), a representative structured prediction task that challenges LLMs to infer fine-grained entity–relation semantics. Task-aware retrieval ensures that retrieved examples are semantically aligned with the target instance, while label-induced reasoning enriches each demonstration with label-grounded explanatory logic. These mechanisms substantially narrow the performance gap between ICL and fully supervised models. We then extend this framework to multimodal ICL, leveraging GPT-4o for visual question answering (VQA) and Whisper-large-v3 for audio question answering (AudioQA). Across both textual and multimodal benchmarks, our framework consistently outperforms GPT-3 and GPT-4 baselines and achieves competitive or superior results compared with fine-tuned models. These findings demonstrate that task-aware retrieval and label-induced reasoning together form a generalizable foundation for a unified in-context learning paradigm across modalities.

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper

Cite This Study

Wan et al. (Thu,) studied this question.

synapsesocial.com/papers/69b8ef6ddeb47d591b8c5825 https://doi.org/https://doi.org/10.5715/jnlp.33.207

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper