What question did this study set out to answer?

To evaluate existing tools and methods for SSH citations and propose improvements specific to the field.

January 14, 2026Open Access

D4.2 Report on Existing Datasets, Methods and Tools for the SSH Citation Index

Key Points

To evaluate existing tools and methods for SSH citations and propose improvements specific to the field.
Systematic analysis of bibliographic reference processing specific to SSH
Benchmarking large language models with SSH typical conditions
Review of datasets and tools for reference extraction and citation linking
Analysis of existing citation intent classification models and their limitations
Demonstrated challenges in SSH citation practices compared to STEM approaches
Identified large language models as viable alternatives for citation processing
Proposed the creation of a new benchmark dataset tailored for SSH citation linking
Outlined the need for SSH-specific taxonomies to enhance citation intent classification

Abstract

This deliverable D4.2 presents the main results of GRAPHIA ’s work on understanding, evaluating, and preparing the technical foundations for the SSH Citation Index modules. Its central contribution is to show, through systematic analysis and empirical evaluation, that bibliographic reference processing in the Social Sciences and Humanities (SSH) raises challenges that are not adequately addressed by approaches developed primarily for STEM disciplines, and to identify concrete ways in which GRAPHIA can respond to these challenges. For reference extraction and parsing, the deliverable combines a detailed review of existing datasets and tools with an extensive benchmarking of large language models under SSH-typical conditions. By testing multilingual documents, footnote-heavy articles, and heterogeneous layouts, the work goes beyond standard journal-centric evaluations and provides clear evidence of where traditional supervised pipelines perform well and where they break down. It also shows that LLM-based methods can be a flexible and competitive alternative for SSH material when they are carefully guided through segmentation strategies and structured outputs. These results directly inform future technical choices for the SSH Citation Index, particularly with respect to robustness, scalability, and multilingual coverage. In the area of citation intent classification, the deliverable shows that existing models and datasets—largely developed for narrowly defined STEM domains—rest on assumptions that do not align well with SSH writing practices. SSH citations are often argumentative, interpretative, and distributed across longer stretches of text, which challenges prevailing annotation schemes and modelling strategies. By analysing these limitations and outlining prospects for SSH-specific taxonomies and datasets, the deliverable reframes citation intent classification as an opportunity to enrich SSH citations with semantic information that better reflects disciplinary practices, rather than as a simple task of transferring existing methods. For citation linking, the deliverable highlights why this task is particularly demanding in SSH: references frequently point to books and chapters without DOIs, appear in multiple languages, and rely on inconsistent or incomplete metadata. The analysis of existing resources indicates that no current benchmark adequately captures this complexity. As a result, the deliverable motivates and specifies the creation of a new SSH-oriented benchmark dataset, providing a concrete roadmap for evaluating linking methods against modern open bibliographic infrastructures. This work is essential for improving the reliability and coverage of citation links within the SSH Knowledge Graph. Overall, the deliverable establishes a coherent methodological baseline for the SSH Citation Index modules within GRAPHIA. By combining state-of-the-art review, targeted benchmarking, and forward-looking dataset design, it supports informed technical decisions in subsequent work packages and lays the groundwork for sustainable, SSH-sensitive citation services that can be integrated into the broader GRAPHIA infrastructure. Funded by the European Union. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the Agency. Neither the European Union nor the granting authority can be held responsible for them.

D4.2 Report on Existing Datasets, Methods and Tools for the SSH Citation Index

Key Points

Abstract

Cite This Study

Also Consider

Also Consider