• Offers a new vision on automated data harmonization in Dataspace (DS) systems. • Introduces LLM-based methods for scalable DS ingestion of heterogeneous datasources. • Presents a system with Harmonizer, Transformer, Evaluator components for ingestion. • Demonstrated an automated data ingestion prototype using LLM agents. • Validates the system with healthcare use case harmonizing heterogeneous data sources. Dataspaces (DS) enable stakeholders to collaborate on innovative, data-driven services by integrating data across domains. However, the realization and adoption of DS remain challenging due to domain-specific heterogeneity at the system, service, and data levels. While system and service-level heterogeneity can often be addressed through standards, data-level heterogeneity, namely data structures and semantics variations, remains challenging. To effectively ingest data into the DS, two communication endpoints must correctly interpret each other’s data models, therefore, DS ecosystems rely on “harmonization”, the process of generating a unified target data model from heterogeneous sources and transforming incoming data accordingly. Currently, harmonization and transformation are performed manually whenever new data sources are integrated. This is time-consuming, costly, and difficult to scale, posing a critical barrier to the realization and adoption of DS in practice. This study proposes a novel methodology for automated data harmonization during ingestion into DS ecosystems. The approach integrates harmonization, transformation, and human-in-the-loop evaluation within an automated system powered by modern LLM-based AI agents. These agents address data-level heterogeneity and generate harmonized target data models, representing a substantial departure from current manually-handled data harmonization. The system is validated through a healthcare use case, demonstrating its practical feasibility for harmonization during data ingestion into the DS. Overall, this work provides a foundational step toward seamless, efficient, and scalable data integration in DS. By automating data harmonization, it delivers substantial value to industry digital solutions as well as domains where data heterogeneity persists, including IoT or Big Data platforms.
Singh et al. (Sun,) studied this question.