What question did this study set out to answer?

The aim is to create a processed, accessible version of the OpenAIRE citation graph dataset to facilitate community use.

February 14, 2026Open Access

An distilled OpenAIRE citation graph

Key Points

The aim is to create a processed, accessible version of the OpenAIRE citation graph dataset to facilitate community use.
Developed a downscaled version of the citation graph to 32GB from a 2.5TB dataset.
Provided multiple CSV file formats for nodes and edges of the graph.
Compressed files to enhance ease of sharing and processing.
Successfully created a manageable OpenAIRE graph while preserving the full structure.
Facilitated community access through simplified formats for further data manipulation.

Abstract

The OpenAIRE graph contains a large citation graph dataset, with over 200 million publications andover 2 billion citations. The current graph is available as a dump with metadata which uncompressedtotals 2. 5TB. This makes it hard to process on conventional computers. To make this networkmore available for the community we provide a processed OpenAIRE network which is downscaledto 32GB, while preserving the full graph structure. Apart from this we offer the processed data invery simple format, which allows further straightforward manipulation. The files are: publications. csv - The nodes in the citation graph citations. csv - The edges in the citation graph publicationₗarge. csv - The nodes, but with several fields for additional features. All files are compressed (. xz) files. The fields in the publicationₗarge. csv: Field Explanation Memory usage (GB) nodeId Unique internal identifier for the node (publication) 2 openaireId Identifier assigned by the OpenAIRE platform 18 doi Digital Object Identifier of the publication 13 title Title of the publication 28 authors List of authors associated with the publication 20 description Abstract or short description of the publication 192 date Date when the publication was published 11 container Journal, conference, or repository where it was published 13 citations Number of times the publication has been cited 2 language Language in which the publication is written 10 The pipeline used to produce the data is found in the pipeline. tar. xz file. It is also found here: link

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Skarding et al. (Thu,) studied this question.

synapsesocial.com/papers/699011522ccff479cfe57d08 https://doi.org/https://doi.org/10.5281/zenodo.18402099

Bookmark

View Full Paper