This paper presents a multimodal large-model pipeline for constructing a cultural-tourism integration knowledge graph (CTI-KG). By harmonizing text, imagery, audio and geospatial data, the pipeline automatically extracts and links cultural entities, tourism services and intangible heritage, fusing them into a unified semantic layer. A cross-modal alignment module transfers shared representations among vision, language and geographic signals, enabling coherent knowledge fusion without manual pairing. The resulting graph supports multiple smart tourism applications, including personalized route planning, immersive storytelling and dynamic resource recommendation. Extensive evaluations confirm that the new approach improves semantic richness and user engagement while remaining generalizable to other heritage domains. The framework offers an open, reusable methodology for constructing multimodal knowledge graphs at scale.
Xuxiang Zhang (Sun,) studied this question.