March 3, 2026

Identifying and classifying software mentions in full text scholarly documents

Key Points

Extraction accuracy improves significantly with large language models compared to traditional methods, supporting reproducibility.
LLMs exhibit strong contextual reasoning, enhancing the identification of software mentions within diverse academic texts.
Evaluation utilized three gold-standard corpora to compare prompting strategies and configurations against established baselines.
Implications for improving open science practices through better integration of software references into scholarly articles are highlighted.

Abstract

Software is central to modern science, yet references to it in scholarly articles are often incomplete or inconsistent, hindering reproducibility and reuse. Existing methods for mining software mentions, such as rule-based and conventional NLP approaches, remain limited in scalability and robustness. The emergence of large language models (LLMs) offers new opportunities for improving this task. LLMs exhibit strong contextual reasoning and adaptability, making them well suited to extracting software mentions from heterogeneous academic texts. In this paper, we evaluate several LLM-based approaches using three gold-standard corpora, comparing prompting strategies and configurations against established baselines. Our contributions are threefold: (1) we provide the first systematic evaluation of LLMs for software mention extraction, (2) we analyse their strengths and weaknesses relative to prior techniques, and (3) we discuss implications for reproducibility and open science. Results show that LLMs significantly improve extraction accuracy and adaptability, advancing efforts to integrate software into the scholarly record.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

David Pride

The Open University

Matteo Guenci

Martin Dočekal

Brno University of Technology

Actions

Institutions

University of Bologna

The Open University

Brno University of Technology

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Identifying and classifying software mentions in full text scholarly documents

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study