Cross-modal semantic matching in intelligent annotation and retrieval of cultural resources faces semantic fault caused by contextual dependence of cultural concepts, and the problem of feature granularity mismatch that is difficult to deal with in traditional models. Therefore, this study proposes a cross-modal semantic matching method based on cultural feature decoupling, constructs a three-layer visual semantic analysis system of "parts-objects-cultural themes", explicitly models cultural symbols in cultural relics images, designs a culture-enhanced comparative learning framework, and introduces authoritative classification standards to constrain the cultural consistency of embedded spaces. In the core method, the cultural feature decoupling framework adopts a three-level feature extraction structure, and extracts cultural semantic features with different granularity layer by layer through multi-scale feature extractor, differentiable semantic clustering layer and other modules, and realizes feature fusion by using Gated Feature Interaction Unit (GFIU). The cross-modal semantic alignment mechanism improves the accuracy of image-text matching and the adaptability of cultural context through hierarchical comparative learning and cultural semantic constraint mechanism. Experiments are carried out on three multi-modal cultural resource data sets (GG-Bronze, Dunhuang-M, CCH-20). The results show that the proposed method is significantly superior to the baseline models such as CLIP and ALIGN in cross-modal retrieval performance. On GG-Bronze data set, R@1 reaches 82.4% and mAP reaches 84.9%. In the component-level cultural symbol matching task, the accuracy of bronze decoration positioning is improved by over 68% compared with CLIP. Culture-specific ablation experiments verify its cross-domain generalization ability, and the untrained Mayan cultural relics retrieval R@5 reaches 82.1%. The measured performance of the system shows that the delay of this method is less than 50ms under the index scale of 100 million, which meets the demand of high concurrency and real-time, and provides effective technical support for the cross-modal cultural resource management of digital cultural relics, cultural education and creative industries.
Lin et al. (Sun,) studied this question.