ABSTRACT. In the petabyte era, climate research deals with large and extremely large datasets on a daily basis. Filling in metadata accompanying climate datasets is challenging in many cases. It can be time consuming, often leads to incomplete results and is very error prone. Arguably, most researchers fill only the minimal set of metadata required to publish their data (i.e. software, publication), mostly out of time constraints. The metadata fields are also not filled consistently. For the institution for example sometimes an abbreviation, while the other times the full name is used. There are multiple lower/upper case issues. Moreover, users do not always choose the same names for the same variables they are describing. In multiple cases there are FAIR compliance gaps (findable, accessible, interoperable, reusable). In this talk, we present the idea of an automatic AI-based FAIR-compliant metadata for climate research in order to deal with the aforementioned challenges. Based on an interdisciplinary collaboration within the Leibniz Science Campus “Digital Transformation of Research” (DiTraRe), we created a work plan connecting researchers from the climate domain as well as computer science experts and infrastructure providers (RADAR). Within this framework, we aim to develop a scalable infrastructure that leverages natural language processing (NLP), knowledge graphs, and large language models (LLMs) to support the harmonisation and semantic alignment of metadata in climate research repositories. Our output will be a curated, machine-actionable metadata set that can support both the integration of scientific data and downstream AI research. We aim to deliver not only technical tools but also sustainable resources for the community, including an openly accessible metadata set and methods for its continuous extension and reuse.
Bach et al. (Thu,) studied this question.