Recent image captioning methods based on pre-trained vision–language models can generate fluent and coherent descriptions, yet they still struggle in open-domain scenes that contain long-tail concepts, uncommon object combinations, and ambiguous visual evidence. Two limitations are especially important. First, the knowledge needed to recognize and name rare or domain-specific entities is only weakly represented in model parameters, causing captions to be generic, incomplete, or biased toward frequent concepts. Second, token generation is typically grounded mainly by local visual matching, making it sensitive to clutter, occlusion, and visually similar distractors, and therefore prone to attribute errors, relation confusion, and object hallucination. To address these issues, we propose R2G (retrieval- and grounding-guided captioning), a lightweight plug-in framework for frozen image captioning backbones. R2G consists of two complementary components. The first, retrieval-guided visual prompting, retrieves image-relevant concepts from an external visual concept memory, converts them into a continuous prompt representation, and injects this representation into selected layers of the visual encoder, so that external semantic information can influence visual feature formation before decoding begins. The second, global–local semantic grounding, derives a global semantic prior from an auxiliary vision–language encoder and adaptively fuses it with token-level local visual evidence through a decoder-state-dependent gating mechanism, thereby improving semantic stability while preserving fine-grained visual support. The resulting framework is lightweight, compatible with frozen pre-trained backbones, and designed to improve both concept coverage and semantic faithfulness. Experimental results on MS-COCO and NoCaps show that R2G consistently improves caption quality over the baseline and yields particularly clear gains in open-domain and out-of-domain settings.
Lin et al. (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: