This paper presents a multimodal conversational assistant deployed as a web application to enhance the exploration of cultural heritage sites. The system enables users to query contextually relevant tourism information through both text and images, offering a seamless and interactive experience. At its core, a custom Retrieval-Augmented Generation (RAG) architecture retrieves relevant knowledge from a vector database and conditions a Large Language Model (LLM) to generate accurate and coherent responses. For visual queries, the assistant leverages a You Only Look Once (YOLO) object detection model to recognize monuments from user-uploaded images, providing concise descriptions and supporting follow-up conversations grounded in visual context. The object detection model achieves a high mean Average Precision (mAP@50) of 0.995, while the RAG pipeline demonstrates strong performance with 0.96 in context recall, 0.98 in faithfulness, and 0.88 in factual correctness. This work highlights the potential of combining vision and language models to deliver reliable and engaging support for culturally informed tourism.
Building similarity graph...
Analyzing shared references across papers
Loading...
Ritu Ram Ojha
Tribhuvan University
Institute of Engineering
Building similarity graph...
Analyzing shared references across papers
Loading...
Ritu Ram Ojha (Wed,) studied this question.
synapsesocial.com/papers/68c1840e9b7b07f3a06106a1 — DOI: https://doi.org/10.36227/techrxiv.175691179.92356541/v1