What question did this study set out to answer?

April 29, 2026Open Access

Robustness and authorship bias of large language models in scientific abstracts scoring

Key Points

This research aims to evaluate the robustness and authorship bias of large language models in scoring scientific abstracts.
Conducted three controlled experiments using abstracts from five arXiv categories.
Introduced author information as a perturbation in scoring, including fake authoritative and non-authoritative CVs.
Evaluated model performance against 10 specific criteria to identify biases and robustness issues.
LLMs showed significant authorship bias when exposed to different author contexts.
The introduction of author information led to inconsistent scoring across various conditions.
Results underscore the necessity for improving fairness and transparency in automated text scoring systems.

Abstract

The use of large language models (LLMs) is increasingly explored in the field of automatic text scoring, particularly in tasks such as automatic essay scoring (AES) and abstract screening for systematic reviews. While prior research has focused on evaluating the accuracy of these models, their robustness and potential biases remain underexplored. To address this gap, we investigated robustness to novel author information and authorship bias of four LLMs in scientific abstracts scoring on 10 evaluation criteria. We conducted three controlled experiments on abstracts from five arXiv categories, comparing baseline scores against conditions where author information was introduced as a perturbation. These perturbations included: associating abstracts with fake authoritative CVs, associating them with fake non-authoritative CVs (both generated by another LLM), and associating abstracts with famous, well-known authors. The results of our controlled analyses illustrate that LLMs lack robustness and exhibit systematic authorship bias when author context is provided. These findings highlight the need for further research to ensure fairness and transparency in automated scoring systems.

Mark Helpful

Bookmark

Relay

View Full Paper