Abstract Generating authoritative and contextually rich captions for Chinese cultural relics remains a significant challenge for standard vision-language models due to the specialized terminology and deep historical context required. We propose a novel, multi-stage retrieval-augmented generation framework designed to bridge the gap between visual identification and expert-level documentation. Our pipeline first utilizes a contrastive language-image pre-training-based encoder to map artifact images into a high-level semantic space, providing initial linguistic anchors. These anchors serve as queries for a tiered knowledge retrieval system that extracts fine-grained, domain-specific information from a curated repository of Chinese cultural heritage. To ensure factual integrity, the framework synthesizes these multi-source inputs into a unified text knowledge vector, which is integrated with visual features through a late-fusion multi-layer perceptron adapter. This aligned multimodal representation is then processed by a large language model optimized via low-rank adaptation to produce comprehensive, culturally grounded captions. Experimental results demonstrate that our framework significantly outperforms state-of-the-art baselines in both automatic metrics and human expert evaluations, effectively mitigating hallucinations and providing a scalable solution for digital museology.
Building similarity graph...
Analyzing shared references across papers
Loading...
Chenggang Mi
Shaoliang Xie
Digital Scholarship in the Humanities
Xi'an International Studies University
Building similarity graph...
Analyzing shared references across papers
Loading...
Mi et al. (Fri,) studied this question.
www.synapsesocial.com/papers/6a002087c8f74e3340f9b65f — DOI: https://doi.org/10.1093/llc/fqag059