March 3, 2026Open Access

Multilingual AI pipeline for museum knowledge and translation: Terminology curation, semantic structuring, and LLM-enhanced analysis

Key Points

Significantly improved OCR post-correction accuracy was achieved with curated terminology and semantic relations.
The project developed a multilingual AI pipeline, incorporating terminology curation and LLM-enhanced document analysis across three languages.
Experiments involved correcting 294 pages, enhancing accuracy, particularly for Arabic resources, highlighting unique language-specific challenges.
This approach emphasizes scalable frameworks for multilingual museum documentation, useful for heritage institutions worldwide.

Abstract

This paper presents a multilingual AI pipeline designed to strengthen museum knowledge engineering by combining human-curated terminology, semantic consolidation, and LLM-enhanced document analysis. Developed as part of a 2025 collaboration between the Louvre Abu Dhabi (LAD), the Sorbonne Center for Artificial Intelligence (SCAI), Sorbonne Université, and Sorbonne University Abu Dhabi, the project addresses long-standing gaps in multilingual museum documentation across English, French, and Arabic. Challenges include terminology heterogeneity, limited availability of specialized Arabic resources, and inconsistencies in cross-lingual metadata-all of which constrain translation, cataloguing, and digitization workflows.To address these issues, we introduce a three-module pipeline:1. Multilingual Terminological Resource: a validated corpus of 400 entries and 68 bibliographic records, enriched with definitions, authoritative sources, and images.2 . Semantic Structuring and Termbase Integration: the consolidation of nearly 1,000 terms into the Louvre Abu Dhabi Termbase, following LAD's concept-based, metadata-rich model and hybrid prescriptive/descriptive methodology.3 . LLM-Enhanced OCR and Metadata Extraction: a document analysis workflow combining OCR engines with multimodal LLMs for transcription, post-correction, and structured artifact metadata extraction across all three languages.Experiments demonstrate that curated terminology and semantic relations significantly improve the accuracy of OCR post-correction and LLM extraction-especially for Arabic, where language-specific thresholds and manual gold-standard pages (294 corrected pages) were essential. The resulting workflow provides a scalable methodological framework for multilingual museum documentation and a transferable blueprint for heritage institutions.The resulting workflow provides a scalable methodological framework for multilingual museum documentation and demonstrates how concept-based terminological resources can function as effective semantic constraints for LLM-driven document analysis in low-resource multilingual settings.

Multilingual AI pipeline for museum knowledge and translation: Terminology curation, semantic structuring, and LLM-enhanced analysis

Key Points

Abstract

Cite This Study