What question did this study set out to answer?

This work aims to develop a resilient metadata harvester for institutional repositories, enhancing data quality and operational sustainability.

June 23, 2026Open Access

Designing an external metadata harvester for DSpace: Lessons from the Imec Repository

Key Points

This work aims to develop a resilient metadata harvester for institutional repositories, enhancing data quality and operational sustainability.
Developed an independent crawler microservice for metadata harvesting from Web of Science and Crossref.
Implemented deduplication and controlled metadata merging while communicating with DSpace via REST API.
Deployed the service as a Linux microservice with an incremental harvesting strategy.
Increased metadata quality and coverage through improved findability and interoperability.
Enabled independent scaling and configuration of the harvester, reducing operational risks.
Supported FAIR principles, enhancing machine-actionability and long-term preservation.

Abstract

Institutional repositories increasingly depend on external scholarly data sources to improve coverage, timeliness, and metadata quality. However, tight coupling between harvesting logic and repository platforms often introduces operational risk, complicates upgrades, and limits sustainability. This presentation describes the design and implementation of an independent crawler microservice developed for the Imec institutional repository. The service harvests publication metadata from Web of Science and Crossref, performs deduplication and controlled metadata merging, and communicates with DSpace exclusively through its REST API. By fully decoupling crawling and enrichment logic from the repository core, the solution enables independent scaling, configuration, and failure isolation, while remaining upgrade-safe across DSpace versions. We will present the crawler's architecture, deployment as a Linux service, incremental harvesting strategy, DOI-based deduplication, and a transparent metadata precedence model balancing licensed and open sources. The approach directly supports FAIR principles by improving findability, interoperability, and machine-actionability, while reducing long-term maintenance and preservation risk. The session concludes with lessons learned, design trade-offs, and recommendations for repository developers seeking resilient, future-proof integrations with emerging scholarly infrastructure.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper