August 11, 2025

Responsible Uses of Large Language Models for Research Evaluation

Key Points

LLMs can predict article quality ratings that align more closely with expert judgements than citation metrics, suggesting a potential shift in research evaluation methods.
The study identifies issues with LLMs, including authors potentially manipulating abstracts for better scores and the opaque nature of LLM scoring mechanisms.
While citations capture substantive contributions to science, LLM scores lack clear connections to scholarly progress, raising concerns about their reliability.
The development of newer LLMs could result in significantly varying scores, suggesting a need for caution in relying solely on such indicators for research evaluation.

Abstract

Although research evaluators and scientometricians have promoted the message of responsible bibliometrics through initiatives like the Leiden Manifesto, these do not mention Large Language Models (LLMs). LLMs can now make useful quality predictions for journal articles, giving values that correlate more strongly with expert judgements than do citation-based indicators in most fields. This has created the possibility that they could supplement or even replace citation-based indicators for some applications. As tested so far, LLMs predict the quality rating that a human expert would give a paper. They do this by reading the quality level descriptions and then processing the article title and abstract. This raises multiple new issues in comparison to the Leiden Manifesto. First, authors might try to trick LLMs into giving high scores by crafting LLM-friendly abstracts. Second, LLM models incorporate billions of parameters, so their scores are opaque. Third, it is not clear how LLMs work in terms of the main influences on their scores, so their biases are unknown. Fourth, whilst citations reflect tangible and permanent contributions to the scientific record, albeit of variable value, LLM-based predictions do not clearly link to progress. Fifth, LLM scores are ephemeral in the sense that newer LLMs may give substantially different scores and rankings.

Perguntar à IA

Bookmark

Perguntar à IA

Bookmark

Responsible Uses of Large Language Models for Research Evaluation

Key Points

Abstract

Cite This Study