What question did this study set out to answer?

The aim is to generate contextually rich and accurate captions for Chinese cultural relics using specialized knowledge retrieval.

May 10, 2026

Beyond the frame: generating culturally rich captions for Chinese relics with retrieval-augmented knowledge

Key Points

The aim is to generate contextually rich and accurate captions for Chinese cultural relics using specialized knowledge retrieval.
Developed a multi-stage retrieval-augmented generation framework.
Utilized a contrastive language-image pre-training-based encoder for initial mapping of images.
Integrated a late-fusion multi-layer perceptron adapter with a large language model to produce captions.
Significantly outperformed state-of-the-art models in automatic metrics and expert evaluations.
Effectively mitigated hallucinations in caption generation.

Abstract

Abstract Generating authoritative and contextually rich captions for Chinese cultural relics remains a significant challenge for standard vision-language models due to the specialized terminology and deep historical context required. We propose a novel, multi-stage retrieval-augmented generation framework designed to bridge the gap between visual identification and expert-level documentation. Our pipeline first utilizes a contrastive language-image pre-training-based encoder to map artifact images into a high-level semantic space, providing initial linguistic anchors. These anchors serve as queries for a tiered knowledge retrieval system that extracts fine-grained, domain-specific information from a curated repository of Chinese cultural heritage. To ensure factual integrity, the framework synthesizes these multi-source inputs into a unified text knowledge vector, which is integrated with visual features through a late-fusion multi-layer perceptron adapter. This aligned multimodal representation is then processed by a large language model optimized via low-rank adaptation to produce comprehensive, culturally grounded captions. Experimental results demonstrate that our framework significantly outperforms state-of-the-art baselines in both automatic metrics and human expert evaluations, effectively mitigating hallucinations and providing a scalable solution for digital museology.

اسأل الذكاء الاصطناعي

Bookmark

اسأل الذكاء الاصطناعي

Bookmark

Beyond the frame: generating culturally rich captions for Chinese relics with retrieval-augmented knowledge

Key Points

Abstract

Cite This Study