This paper presents a multilingual AI pipeline designed to strengthen museum knowledge engineering by combining human-curated terminology, semantic consolidation, and LLM-enhanced document analysis. Developed as part of a 2025 collaboration between the Louvre Abu Dhabi (LAD), the Sorbonne Center for Artificial Intelligence (SCAI), Sorbonne Université, and Sorbonne University Abu Dhabi, the project addresses long-standing gaps in multilingual museum documentation across English, French, and Arabic. Challenges include terminology heterogeneity, limited availability of specialized Arabic resources, and inconsistencies in cross-lingual metadata-all of which constrain translation, cataloguing, and digitization workflows.To address these issues, we introduce a three-module pipeline:1. Multilingual Terminological Resource: a validated corpus of 400 entries and 68 bibliographic records, enriched with definitions, authoritative sources, and images.2 . Semantic Structuring and Termbase Integration: the consolidation of nearly 1,000 terms into the Louvre Abu Dhabi Termbase, following LAD's concept-based, metadata-rich model and hybrid prescriptive/descriptive methodology.3 . LLM-Enhanced OCR and Metadata Extraction: a document analysis workflow combining OCR engines with multimodal LLMs for transcription, post-correction, and structured artifact metadata extraction across all three languages.Experiments demonstrate that curated terminology and semantic relations significantly improve the accuracy of OCR post-correction and LLM extraction-especially for Arabic, where language-specific thresholds and manual gold-standard pages (294 corrected pages) were essential. The resulting workflow provides a scalable methodological framework for multilingual museum documentation and a transferable blueprint for heritage institutions.The resulting workflow provides a scalable methodological framework for multilingual museum documentation and demonstrates how concept-based terminological resources can function as effective semantic constraints for LLM-driven document analysis in low-resource multilingual settings.
Aldarmaki et al. (Wed,) studied this question.