What question did this study set out to answer?

This study aims to enhance image captioning accuracy for open-domain scenes with uncommon concepts and visual ambiguity.

May 15, 2026Open Access

Retrieval-Guided and Semantically Grounded Image Captioning for Open-Domain Scenes

Key Points

This study aims to enhance image captioning accuracy for open-domain scenes with uncommon concepts and visual ambiguity.
Developed R2G, a lightweight plug-in for image captioning backbones.
Implemented retrieval-guided visual prompting and global-local semantic grounding.
Conducted experiments on MS-COCO and NoCaps datasets.
R2G improved caption quality consistently over baseline across various settings.
Significant gains were observed in open-domain and out-of-domain scenarios.
Enhanced performance was linked to better concept coverage and semantic accuracy.

Abstract

Recent image captioning methods based on pre-trained vision–language models can generate fluent and coherent descriptions, yet they still struggle in open-domain scenes that contain long-tail concepts, uncommon object combinations, and ambiguous visual evidence. Two limitations are especially important. First, the knowledge needed to recognize and name rare or domain-specific entities is only weakly represented in model parameters, causing captions to be generic, incomplete, or biased toward frequent concepts. Second, token generation is typically grounded mainly by local visual matching, making it sensitive to clutter, occlusion, and visually similar distractors, and therefore prone to attribute errors, relation confusion, and object hallucination. To address these issues, we propose R2G (retrieval- and grounding-guided captioning), a lightweight plug-in framework for frozen image captioning backbones. R2G consists of two complementary components. The first, retrieval-guided visual prompting, retrieves image-relevant concepts from an external visual concept memory, converts them into a continuous prompt representation, and injects this representation into selected layers of the visual encoder, so that external semantic information can influence visual feature formation before decoding begins. The second, global–local semantic grounding, derives a global semantic prior from an auxiliary vision–language encoder and adaptively fuses it with token-level local visual evidence through a decoder-state-dependent gating mechanism, thereby improving semantic stability while preserving fine-grained visual support. The resulting framework is lightweight, compatible with frozen pre-trained backbones, and designed to improve both concept coverage and semantic faithfulness. Experimental results on MS-COCO and NoCaps show that R2G consistently improves caption quality over the baseline and yields particularly clear gains in open-domain and out-of-domain settings.

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper

Cite This Study

Lin et al. (Wed,) studied this question.

synapsesocial.com/papers/6a06b95be7dec685947ac00b https://doi.org/https://doi.org/10.3390/math14101667

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

KI fragen

Bookmark

View Full Paper