The use of large language models (LLMs) is increasingly explored in the field of automatic text scoring, particularly in tasks such as automatic essay scoring (AES) and abstract screening for systematic reviews. While prior research has focused on evaluating the accuracy of these models, their robustness and potential biases remain underexplored. To address this gap, we investigated robustness to novel author information and authorship bias of four LLMs in scientific abstracts scoring on 10 evaluation criteria. We conducted three controlled experiments on abstracts from five arXiv categories, comparing baseline scores against conditions where author information was introduced as a perturbation. These perturbations included: associating abstracts with fake authoritative CVs, associating them with fake non-authoritative CVs (both generated by another LLM), and associating abstracts with famous, well-known authors. The results of our controlled analyses illustrate that LLMs lack robustness and exhibit systematic authorship bias when author context is provided. These findings highlight the need for further research to ensure fairness and transparency in automated scoring systems.
Sajeva et al. (Mon,) studied this question.